FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

Supplementary Material


All videos are compressed. We recommend watching all videos in full screen. Click on the videos for seeing them in full scale.

Our FlowVid Results

Input video "a woman wearing headphones, in flat 2d anime" "a Greek statue wearing headphones"
Input video "a Chinese ink painting of a panda eating bamboo" "a koala eating bamboo"
Input video "A pixel art of an artist's rendering of an earth in space" "An artist's rendering of a Mars in space"
Input video "Ukiyo-e Art - a man is pulling a rope in a gym " "A gorilla is pulling a rope in a gym"
Input video "A shirtless man is doing a workout in a park, with the Egyptian pyramids visible in the distance." "Batman is doing a workout in a park"


Comparisons to State-of-the-art video-to-video methods

We compare our flowVid to

For more results on DAVIS, please check: CoDeF, Rerender-a-Video, TokenFlow, and Ours.
Edit Prompt: A pirate is rowing a boat on a lake
Input video Ours Per frame ControlNet ([1])
CoDeF ([2]) Rerender ([3]) TokenFlow ([4])
Edit prompt: a oil painting of a tiger walking
Input video Ours Per frame ControlNet ([1])
CoDeF ([2]) Rerender ([3]) TokenFlow ([4])
Edit Prompt: a woman dressed as santa claus is standing in the snow, in flat 2d anime
Input video Ours Per frame ControlNet ([1])
CoDeF ([2]) Rerender ([3]) TokenFlow ([4])


We ablate color calibration (Figure 4) and condition types (Figure 7).

Ablation of color calibration(Figure 4). When the autoregressive evaluation goes from 1st batch to 7th batch, the results without color calibration become gray (in middle). The results are more stable with the proposed color calibration (in right). Edit prompt: "A man running on Mars".
Ablation study of different spatial conditions.(Figure 7). Canny edge provides more detailed controls (good for stylization) while depth map provides more editing flexibility (good for object swap).


We show limitations of our FlowVid (Figure 9).

Limitations of FlowVid. Failure cases include the edited first frame doesn't align structurally with original first frame (the top elephant video), and large occlusions caused by fast motion (the bottom ballerina video).


[1] Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[2] Ouyang, Hao, et al. "Codef: Content deformation fields for temporally consistent video processing." arXiv preprint arXiv:2308.07926 (2023).

[3] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation, 2023.

[4] Geyer, Michal, et al. "Tokenflow: Consistent diffusion features for consistent video editing." arXiv preprint arXiv:2307.10373 (2023).