Let's Take a Look at TokenFlow's Ablation Study

cover
18 Dec 2024

Abstract and 1. Introduction

2 Related Work

3 Preliminaries

4 Method

4.1 Key Sample and Joint Editing

4.2 Edit Propagation Via TokenFlow

5 Results

5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation

5.3 Ablation Study

6 Discussion

7 Acknowledgement and References

A Implementation Details

5.3 ABLATION STUDY

First, we ablate the use of TokenFlow, Sec. 4.2, for enforcing temporal consistency. In this experiment, we replace TokenFlow with extended attention (Eq. 3) and compute it between each frames of the edited video and the keyframes (w joint attention). Second, we ablate the randomizing of the keyframe selection at each generation step (w/o random keyframes). In this experiment, we use the same keyframe indices (evenly spaced in time) across the generation. Table 1 (bottom) shows the quantitative results of our ablations, the resulting videos can be found in the SM. As seen, TokenFlow ensures higher degree of temporal consistency, indicating that solely relying on the extension of self-attention to multiple frames is insufficient for achieving fine-grained temporal consistency. Additionally, fixing the keyframes creates an artificial partition of the video into short clips between the fixed keyframes, which reflects poorly on the consistency of the result.

Table 2: We reconstruct the video using the TokenFlow pipeline, excluding keyframe editing. We evaluate the TokenFlow representation with PSNR and LPIPS metrics. Our reconstruction improves vanilla DDIM inversion, highlighting the robusteness of TokenFlow representation.

This paper is available on arxiv under CC BY 4.0 DEED DEED license.

Authors:

(1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution;

(2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution;

(3) Shai Bagon, Weizmann Institute of Science;

(4) Tali Dekel, Weizmann Institute of Science.