FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis: Abstract and Intro

cover
9 Oct 2024

(1) Feng Liang, The University of Texas at Austin and Work partially done during an internship at Meta GenAI (Email: [email protected]);

(2) Bichen Wu, Meta GenAI and Corresponding author;

(3) Jialiang Wang, Meta GenAI;

(4) Licheng Yu, Meta GenAI;

(5) Kunpeng Li, Meta GenAI;

(6) Yinan Zhao, Meta GenAI;

(7) Ishan Misra, Meta GenAI;

(8) Jia-Bin Huang, Meta GenAI;

(9) Peizhao Zhang, Meta GenAI (Email: [email protected]);

(10) Peter Vajda, Meta GenAI (Email: [email protected]);

(11) Diana Marculescu, The University of Texas at Austin (Email: [email protected]).

Abstract

Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. However, the advancement of video-to-video (V2V) synthesis has been hampered by the challenge of maintaining temporal consistency across video frames. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video. Contrary to prior methods that strictly adhere to optical flow, our approach harnesses its benefits while handling the imperfection in flow estimation. We encode the optical flow via warping from the first frame and serve it as a supplementary reference in the diffusion model. This enables our model for video synthesis by editing the first frame with any prevalent I2I models and then propagating edits to successive frames. Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility: FlowVid works seamlessly with existing I2I models, facilitating various modifications, including stylization, object swaps, and local edits. (2) Efficiency: Generation of a 4-second video with 30 FPS and 512×512 resolution takes only 1.5 minutes, which is 3.1×, 7.2×, and 10.5× faster than CoDeF, Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender (10.2%), and TokenFlow (40.4%).

1. Introduction

Text-guided Video-to-video (V2V) synthesis, which aims to modify the input video according to given text prompts, has

wide applications in various domains, such as short-video creation and more broadly in the film industry. Notable advancements have been seen in text-guided Image-to-Image (I2I) synthesis [4, 14, 31, 43], greatly supported by large pretrained text-to-image diffusion models [37, 39, 40]. However, V2V synthesis remains a formidable task. In contrast to still images, videos encompass an added temporal dimension. Due to the ambiguity of text, there are countless ways to edit frames so they align with the target prompt. Consequently, naively applying I2I models on videos often produces unsatisfactory pixel flickering between frames.

To improve frame consistency, pioneering studies edit multiple frames jointly by inflating the image model with spatial-temporal attention [6, 25, 35, 46]. While these methods offer improvements, they do not fully attain the soughtafter temporal consistency. This is because the motion within videos is merely retained in an implicit manner within the attention module. Furthermore, a growing body of research employs explicit optical flow guidance from videos. Specifically, flow is used to derive pixel correspondence, resulting in a pixel-wise mapping between two frames. The correspondence is later utilized to obtain occlusion masks for inpainting [19, 49] or to construct a canonical image [32] However, these hard constraints can be problematic if flow estimation is inaccurate, which is often observed when the flow is determined through a pre-trained model [42, 47, 48].

In this paper, we propose to harness the benefits of optical flow while handling the imperfection in flow estimation. Specifically, we perform flow warping from the first frame to subsequent frames. These warped frames are expected to follow the structure of the original frames but contain some occluded regions (marked as gray), as shown in Figure 2(b). If we use flow as hard constraints, such as inpainting [19, 49] the occluded regions, the inaccurate legs estimation would persist, leading to an undesirable outcome. We seek to include an additional spatial condition, such as a depth map in Figure 2(c), along with a temporal flow condition. The legs’ position is correct in spatial conditions, and therefore, the joint spatial-temporal condition would rectify the imperfect optical flow, resulting in consistent results in Figure 2(d).

We build a video diffusion model upon an inflated spatial controlled I2I model. We train the model to predict the input video using spatial conditions (e.g., depth maps) and temporal conditions (flow-warped video). During generation, we employ an edit-propagate procedure: (1) Edit the first frame with prevalent I2I models. (2) Propagate the edits throughout the video using our trained model. The decoupled design allows us to adopt an autoregressive mechanism: the current batch’s last frame can be the next batch’s first frame, allowing us to generate lengthy videos.

We train our model with 100k real videos from ShutterStock [1], and it generalizes well to different types of modifications, such as stylization, object swaps, and local edits, as seen in Figure 1. Compared with existing V2V methods, our FlowVid demonstrates significant advantages in terms of efficiency and quality. Our FlowVid can generate 120 frames (4 seconds at 30 FPS) in high-resolution (512×512) in just 1.5 minutes on one A-100 GPU, which is 3.1×, 7.2× and 10.5× faster than state-of-the-art methods CoDeF [32] (4.6 minutes) Rerender [49] (10.8 minutes), and TokenFlow [13] (15.8 minutes). We conducted a user study on 25 DAVIS [34] videos and designed 115 prompts. Results show that our method is more robust and achieves a preference rate of 45.7% compared to CoDeF (3.5%) Rerender (10.2%) and TokenFlow (40.4%)

Our contributions are summarized as follows: (1) We introduce FlowVid, a V2V synthesis method that harnesses the benefits of optical flow, while delicately handling the imperfection in flow estimation. (2) Our decoupled editpropagate design supports multiple applications, including stylization, object swap, and local editing. Furthermore, it empowers us to generate lengthy videos via autoregressive evaluation. (3) Large-scale human evaluation indicates the efficiency and high generation quality of FlowVid.

This paper is available on arxiv under CC 4.0 license.