(1) Feng Liang, The University of Texas at Austin and Work partially done during an internship at Meta GenAI (Email: [email protected]);
(2) Bichen Wu, Meta GenAI and Corresponding author;
(3) Jialiang Wang, Meta GenAI;
(4) Licheng Yu, Meta GenAI;
(5) Kunpeng Li, Meta GenAI;
(6) Yinan Zhao, Meta GenAI;
(7) Ishan Misra, Meta GenAI;
(8) Jia-Bin Huang, Meta GenAI;
(9) Peizhao Zhang, Meta GenAI (Email: [email protected]);
(10) Peter Vajda, Meta GenAI (Email: [email protected]);
(11) Diana Marculescu, The University of Texas at Austin (Email: [email protected]).
Table of Links
- Abstract and Introduction
- 2. Related Work
- 3. Preliminary
- 4. FlowVid
- 4.1. Inflating image U-Net to accommodate video
- 4.2. Training with joint spatial-temporal conditions
- 4.3. Generation: edit the first frame then propagate
-
- Experiments
- 5.1. Settings
- 5.2. Qualitative results
- 5.3. Quantitative results
- 5.4. Ablation study and 5.5. Limitations
- Conclusion, Acknowledgments and References
- A. Webpage Demo and B. Quantitative comparisons
5.4. Ablation study
Condition combinations We study the four types of conditions in Figure 6(a): (I) Spatial controls: such as depth maps [38]. (II) Flow warped video: frames warped from the first frame using optical flow. (III) Flow occlusion: masks indicate which parts are occluded (marked as white). (IV) First frame. We evaluate combinations of these conditions in Figure 6(b), assessing their effectiveness by their winning rate against our full model which contains all four conditions. The spatial-only condition achieved a 9% winning rate, limited by its lack of temporal information. Including flow warped video significantly improved the winning rate to 38%, underscoring the importance of temporal guidance. We use gray pixels to indicate occluded areas, which might blend in with the original gray colors in the images. To avoid potential confusion, we further include a binary flow occlusion mask, which better helps the model to tell which part is occluded or not. The winning rate is further improved to 42%. Finally, we added the first frame condition to provide better texture guidance, particularly useful when the occlusion mask is large and few original pixels remain.
Different control type: edge and depth We study two types of spatial conditions in our FlowVid: canny edge [5] and depth map [38]. Given an input frame as shown in Figure 7(a), the canny edge retains more details than the depth map, as seen from the eyes and mouth of the panda. The strength of spatial control would, in turn, affect the video editing. For style transfer prompt ’A Chinese ink painting of a panda eating bamboo’, as shown in Figure 7(c), the output of canny condition could keep the mouth of the panda in the right position while the depth condition would guess where the mouth is and result in an open mouth. The flexibility of the depth map, however, would be beneficial if we are doing object swap with prompt ’A koala eating bamboo’, as shown in Figure 7(d); the canny edge would put a pair of panda eyes on the face of the koala due to the strong control, while depth map would result in a better koala edit. During our evaluation, we found canny edge works better when we want to keep the structure of the input video as much as possible, such as stylization. The depth map works better if we have a larger scene change, such as an object swap, which requires more considerable editing flexibility.
v**-prediction and** ϵ**-prediction** While ϵ-prediction is commonly used for parameterization in diffusion models, we found it may suffer from unnatural global color shifts across frames, as shown in Figure 8. Even though all these two methods use the same flow warped video, the ϵ-prediction introduces an unnatural grayer color. This phenomenon is also found in Imagen-Video [17].
5.5. Limitations
Although our FlowVid achieves significant performance, it does have some limitations. First, our FlowVid heavily relies on the first frame generation, which should be structurally aligned with the input frame. As shown in Figure 9(a), the edited first frame identifies the hind legs of the elephant as the front nose. The erroneous nose would propagate to the following frame and result in an unsatisfactory final prediction. The other challenge is when the camera or the object moves so fast that large occlusions occur. In this case, our model would guess, sometimes hallucinate, the missing blank regions. As shown in Figure 9(b), when the ballerina turns her body and head, the entire body part is masked out. Our model manages to handle the clothes but turns the back of the head into the front face, which would be confusing if displayed in a video.
This paper is available on arxiv under CC 4.0 license.