Video personalization, which generates customized videos using reference images, has gained significant attention. However, prior methods typically focus on single-concept personalization, limiting broader applications that require multi-concept integration. Attempts to extend these models to multiple concepts often lead to identity blending, which results in composite characters with fused attributes from multiple sources. This challenge arises due to the lack of a mechanism to link each concept with its specific reference image. We address this with anchored prompts, which embed image anchors as unique tokens within text prompts, guiding accurate referencing during generation. Additionally, we introduce concept embeddings to encode the order of reference images. Our approach, Movie Weaver, seamlessly weaves multiple concepts—including face, body, and animal images—into one video, allowing flexible combinations in a single model. The evaluation shows that Movie Weaver outperforms existing methods for multi-concept video personalization in identity preservation and overall quality.
Caption: When we naively extend the single-concept video personalization method like PT2V MovieGen to multi-concept, we find severe identity blending problem that generates composite faces with characteristics from both references.
Caption: (a) Movie Weaver architecture. Compared to the single-concept baseline, reference images are arranged in a specific order for concept embedding, and anchored prompts are utilized. Shared components are omitted for simplicity. (b) Automatic data curation pipeline. For a video-text pair, ① concept descriptions and anchored prompts are generated via in-context learning with Llama-3. After ② extracting body masks, ③ CLIP links each concept to its corresponding image. ④ Finally, face images are obtained using a face segmentation model.
Case | Modules | Human Study Metrics | |||
---|---|---|---|---|---|
AP | CE | sep_yes ↑ | face1_sim ↑ | face2_sim ↑ | |
Baseline | 42.9 | 3.4 | 3.0 | ||
(1) | ✔ | 98.2 | 58.8 | 41.9 | |
(2) | ✔ | ✔ | 99.3 | 66.8 | 66.1 |
Caption: Effect of the proposed Anchored Prompts (AP) and Concept Embeddings (CE). The top part shows the effect of AP and CE, while the bottom presents results from a human study. Metric sep_yes
indicates the percentage of cases where the two generated faces are distinguishable (i.e., no identity blending), face1_sim
and face2_sim
represent where a similar face to the left or right reference face, respectively, is found in the generated video.
You may also want to check qualitative results, comparsion with Vidu, and limitations in the supplimentary videos.
TBD