Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts

Abstract

Video personalization, which generates customized videos using reference images, has gained significant attention. However, prior methods typically focus on single-concept personalization, limiting broader applications that require multi-concept integration. Attempts to extend these models to multiple concepts often lead to identity blending, which results in composite characters with fused attributes from multiple sources. This challenge arises due to the lack of a mechanism to link each concept with its specific reference image. We address this with anchored prompts, which embed image anchors as unique tokens within text prompts, guiding accurate referencing during generation. Additionally, we introduce concept embeddings to encode the order of reference images. Our approach, Movie Weaver, seamlessly weaves multiple concepts—including face, body, and animal images—into one video, allowing flexible combinations in a single model. The evaluation shows that Movie Weaver outperforms existing methods for multi-concept video personalization in identity preservation and overall quality.

Identity Blending of Multi-Concept Personalization

Caption: When we naively extend the single-concept video personalization method like PT2V MovieGen to multi-concept, we find severe identity blending problem that generates composite faces with characteristics from both references.

Movie Weaver

Caption: (a) Movie Weaver architecture. Compared to the single-concept baseline, reference images are arranged in a specific order for concept embedding, and anchored prompts are utilized. Shared components are omitted for simplicity. (b) Automatic data curation pipeline. For a video-text pair, ① concept descriptions and anchored prompts are generated via in-context learning with Llama-3. After ② extracting body masks, ③ CLIP links each concept to its corresponding image. ④ Finally, face images are obtained using a face segmentation model.

Results

Case	Modules		Human Study Metrics
Case	AP	CE	sep_yes ↑	face1_sim ↑	face2_sim ↑
Baseline			42.9	3.4	3.0
(1)		✔	98.2	58.8	41.9
(2)	✔	✔	99.3	66.8	66.1

Caption: Effect of the proposed Anchored Prompts (AP) and Concept Embeddings (CE). The top part shows the effect of AP and CE, while the bottom presents results from a human study. Metric sep_yes indicates the percentage of cases where the two generated faces are distinguishable (i.e., no identity blending), face1_sim and face2_sim represent where a similar face to the left or right reference face, respectively, is found in the generated video.

You may also want to check qualitative results, comparsion with Vidu, and limitations in the supplimentary videos.

BibTeX

@article{liang2025movie,
  title={Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts},
  author={Liang, Feng and Ma, Haoyu and He, Zecheng and Hou, Tingbo and Hou, Ji and Li, Kunpeng and Dai, Xiaoliang and Juefei-Xu, Felix and Azadi, Samaneh and Sinha, Animesh and others},
  journal={arXiv preprint arXiv:2502.07802},
  year={2025}
}