Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Abstract

This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames, we opt to process frames in a streaming fashion, to support unlimited frames. At the heart of StreamV2V lies a backward-looking principle that relates the present to the past. This is realized by maintaining a feature bank, which archives information from past frames. For incoming frames, StreamV2V extends self-attention to include banked keys and values and directly fuses similar past features into the output. The feature bank is continually updated by merging stored and new features, making it compact but informative. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without fine-tuning. It can run 20 FPS on one A100 GPU, being 15×, 46×, 108×, and 158× faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative metrics and user studies confirm StreamV2V's exceptional ability to maintain temporal consistency.

Performance Highlight

We present StreamV2V to support real-time video-to-video translation for streaming input. For webcam input, our StreamV2V supports face swap (e.g., to Elon Musk or Will Smith) and video stylization (e.g., to Claymation or doodle art).

Although StreamV2V is designed for the vid2vid task, it could seamlessly integrate with the txt2img application. Compared with per-image StreamDiffusion, StreamV2V continuously generates images from texts, providing a much smoother transition.

Motivation

(a). Most of existing V2V methods process frames in batches. However, batch processing necessitates loading all frames into GPU memory, thereby limiting the video length they can handle, typically up to 4 seconds. (b). Our StreamV2V processes frames in a streaming fashion, being able to process streaming videos in real-time.

Method

Overview of StreamV2V. Left: StreamV2V reasons the current frame to the past by maintaining a feature bank, which stores the intermediate transformer features. For new coming frames, StreamV2V fetches the stored features and uses them by Extended self-Attention (EA) and direct Feature Fusion (FF). Middle: EA concatenates the stored keys K_fb and values V_fb directly to that of the current frame in the self-attention computation. Right: Operating on the output of transformer blocks, FF first retrieves the similar features in the bank via a cosine similarity matrix, and then conducts a weighted sum to fuse them.

Results

We report the CLIP score and warp error to indicate the consistency of generated videos in the following table.

**Quantitative metrics comparison.** We bold the **best** result and underline the second best.
	StreamDiffusion	CoDeF	Rerender	TokenFlow	FlowVid	StreamV2V (ours)
CLIP score ↑	95.24	96.33	96.20	97.04	96.68	96.58
Warp error ↓	117.01	116.17	107.00	114.25	111.09	102.99

We report our user study resultsand runtime breakdown in the following two figures. The results of different methods and user study interface could be found in this link. You may also want to check comparsion with other methods, ablations, and limitations.

First Image — **User study comparison.** The win rate indicates the frequency our StreamV2V is preferred compared with certain counterpart.

Second Image — **Runtime breakdown.** on one A100 GPU of generating a 4-second 512x512 resolution video with 30 FPS.

BibTeX

@article{liang2024looking,
  title={Looking Backward: Streaming Video-to-Video Translation with Feature Banks},
  author={Liang, Feng and Kodaira, Akio and Xu, Chenfeng and Tomizuka, Masayoshi and Keutzer, Kurt and Marculescu, Diana},
  journal={arXiv preprint arXiv:2405.15757},
  year={2024}
}