Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Supplementary Material

All videos are compressed. We recommend watching all videos in full screen. Click on the videos for seeing them in full scale.

We compare our StreamV2V to

(Fig.5) Edit Prompt: A pixel art of a man doing a handstand on the street.
Input video	Ours	StreamDiffusion ([1])

CoDeF ([2])	Rerender ([3])	FlowVid ([4])

(Fig.8) Ablation on Extended self-Attention (EA) and Feature Fusion (FF). Edit prompt: "A man is surfing, in animation".

(Fig.13) Ablation on different denoising steps. While using fewer denoising steps would accelerate the inference time for every frame, we do observe a certain level of quality drop if we use only 1 step.
Elon Musk

Clyamation

(Fig.9) Left: Per-frame LCM baseline has severe flickering even with slight one or two-word modification. Right: Feature bank would provide a much smoother transition.

(Fig.15) Long video ($>$ 1000 frames) generation. Our StreamV2V can handle arbitrary length of videos without consistency degradation.

(Fig.10) Limitations of StreamV2V.
(a). StreamV2V fails to alter the person within the input video into Pope or Batman.

(b). StreamV2V can produce inconsistent output, as seen in the girl for Anime style and the backpack straps for Van Gogh style.

[1] Kodaira, Akio, et al. "StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation." arXiv preprint arXiv:2312.12491 (2023).

[2] Ouyang, Hao, et al. "Codef: Content deformation fields for temporally consistent video processing." arXiv preprint arXiv:2308.07926 (2023).

[3] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation, 2023.

[4] Liang, Feng, et al. "FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis." arXiv preprint arXiv:2312.17681 (2023).