All videos are compressed. We recommend watching all videos in full screen. Click on the videos for seeing them in full scale.
We compare our StreamV2V to
(Fig.5) Edit Prompt: A pixel art of a man doing a handstand on the street. | ||
---|---|---|
Input video | Ours | StreamDiffusion ([1]) |
CoDeF ([2]) | Rerender ([3]) | FlowVid ([4]) |
(Fig.8) Ablation on Extended self-Attention (EA) and Feature Fusion (FF). Edit prompt: "A man is surfing, in animation". |
---|
(Fig.13) Ablation on different denoising steps. While using fewer denoising steps would accelerate the inference time for every frame, we do observe a certain level of quality drop if we use only 1 step. |
---|
Elon Musk |
Clyamation |
(Fig.9) Left: Per-frame LCM baseline has severe flickering even with slight one or two-word modification. Right: Feature bank would provide a much smoother transition. |
---|
(Fig.15) Long video ($>$ 1000 frames) generation. Our StreamV2V can handle arbitrary length of videos without consistency degradation. |
---|
(Fig.10) Limitations of StreamV2V. |
---|
(a). StreamV2V fails to alter the person within the input video into Pope or Batman. |
(b). StreamV2V can produce inconsistent output, as seen in the girl for Anime style and the backpack straps for Van Gogh style. |
[1] Kodaira, Akio, et al. "StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation." arXiv preprint arXiv:2312.12491 (2023).
[2] Ouyang, Hao, et al. "Codef: Content deformation fields for temporally consistent video processing." arXiv preprint arXiv:2308.07926 (2023).
[3] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation, 2023.
[4] Liang, Feng, et al. "FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis." arXiv preprint arXiv:2312.17681 (2023).