Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts

1The University of Texas at Austin, 2Meta GenAI
*Work partially done during an internship at Meta GenAI.

Abstract

Video personalization, which generates customized videos using reference images, has gained significant attention. However, prior methods typically focus on single-concept personalization, limiting broader applications that require multi-concept integration. Attempts to extend these models to multiple concepts often lead to identity blending, which results in composite characters with fused attributes from multiple sources. This challenge arises due to the lack of a mechanism to link each concept with its specific reference image. We address this with anchored prompts, which embed image anchors as unique tokens within text prompts, guiding accurate referencing during generation. Additionally, we introduce concept embeddings to encode the order of reference images. Our approach, Movie Weaver, seamlessly weaves multiple concepts—including face, body, and animal images—into one video, allowing flexible combinations in a single model. The evaluation shows that Movie Weaver outperforms existing methods for multi-concept video personalization in identity preservation and overall quality.

Identity Blending of Multi-Concept Personalization

Caption: When we naively extend the single-concept video personalization method like PT2V MovieGen to multi-concept, we find severe identity blending problem that generates composite faces with characteristics from both references.

Movie Weaver

Movie Weaver Architecture

Caption: (a) Movie Weaver architecture. Compared to the single-concept baseline, reference images are arranged in a specific order for concept embedding, and anchored prompts are utilized. Shared components are omitted for simplicity. (b) Automatic data curation pipeline. For a video-text pair, ① concept descriptions and anchored prompts are generated via in-context learning with Llama-3. After ② extracting body masks, ③ CLIP links each concept to its corresponding image. ④ Finally, face images are obtained using a face segmentation model.

Results

Case Modules Human Study Metrics
AP CE sep_yes ↑ face1_sim ↑ face2_sim ↑
Baseline 42.9 3.4 3.0
(1) 98.2 58.8 41.9
(2) 99.3 66.8 66.1

Caption: Effect of the proposed Anchored Prompts (AP) and Concept Embeddings (CE). The top part shows the effect of AP and CE, while the bottom presents results from a human study. Metric sep_yes indicates the percentage of cases where the two generated faces are distinguishable (i.e., no identity blending), face1_sim and face2_sim represent where a similar face to the left or right reference face, respectively, is found in the generated video.

BibTeX

TBD