Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. However, the advancement of video-to-video (V2V) synthesis has been hampered by the challenge of maintaining temporal consistency across video frames. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within thesource video. Contrary to prior methods that strictly adhere to optical flow, our approach harnesses its benefits while handling the imperfection in flow estimation. We encode the optical flow via warping from the first frame and serve it as a supplementary reference in the diffusion model. This enables our model for video synthesis by editing the first frame with any prevalent I2I models, and then propagating edits to successive frames. Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility: FlowVid works seamlessly with existing I2I models, facilitating various modifications, including stylization, object swaps, and local edits. (2) Efficiency: Generation of a 4-second video with 30 FPS and 512×512 resolution takes only 1.5 minutes, which is 3.1×, 7.2×, and 10.5× faster than CoDeF, Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender (10.2%), and TokenFlow (40.4%).
Optical flow is widely used in video-to-video models, However, estimated flow can be inaccurate. We propose to use spatial controls in synergy to rectify the inaccurate flow to synthesize consistent output.
We train a video diffusion model with joint spatial-temporal controls. During generation, we edit the first frame with existing I2I models, then feed the spatial controls and warped video to our trained model.
Use your cursor on the video to slide left/right.. The left is the input video, the right is the synthesized video. The edited keyword is marked as blue. You may also want to check comparsion with other methods, ablations, and limitations in the supplimentart videos
Stylization Highlights.
ObjectSwap Highlights.
We conduct a user study on 25 DAVIS videos and 115 manually designed prompts. For results on DAVIS, please see this link.
| Preference rate (mean ± std %) ↑ | Runtime (mins) ↓ | Cost ↓ | |
|---|---|---|---|
| TokenFlow | 40.4 ± 5.3 | 15.8 | 10.5 × | 
| Rerender | 10.2 ± 7.1 | 10.8 | 7.2 × | 
| CoDeF | 3.5 ± 1.9 | 4.6 | 3.1 × | 
| FlowVid (Ours) | 45.7 ± 6.4 | 1.5 | 1.0 × | 
Quantitative comparison with existing V2V models. The preference rate indicates the frequency the method is preferred among all the four methods in human evaluation. Runtime shows the time to synthesize a 4-second video with 512x512 resolution on one A-100-80GB. Cost is normalized with our method.
@article{liang2023flowvid,
  title={FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis},
  author={Liang, Feng and Wu, Bichen and Wang, Jialiang and Yu, Licheng and Li, Kunpeng and Zhao, Yinan and Misra, Ishan and Huang, Jia-Bin and Zhang, Peizhao and Vajda, Peter and others},
  journal={arXiv preprint arXiv:2312.17681},
  year={2023}
}