Text-Driven Image & Video Synthesis: A Closer Look

cover
17 Dec 2024

Abstract and 1. Introduction

2 Related Work

3 Preliminaries

4 Method

4.1 Key Sample and Joint Editing

4.2 Edit Propagation Via TokenFlow

5 Results

5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation

5.3 Ablation Study

6 Discussion

7 Acknowledgement and References

A Implementation Details

Text-driven image & video synthesis Seminal works designed GAN architectures to synthesize images conditioned on text embeddings (Reed et al., 2016; Zhang et al., 2016). With the evergrowing scale of vision-language datasets and pretraining strategies (Radford et al., 2021; Schuhmann et al., 2022), there has been a remarkable progress in text-driven image generation capabilities. Users can sytnesize high-quality visual content using simple text prompts. Much of this progress is also attributed to diffusion models (Sohl-Dickstein et al., 2015; Croitoru et al., 2022; Dhariwal & Nichol, 2021; Ho et al., 2020; Nichol & Dhariwal, 2021) which have been established as stateof-the-art text-to-image generators (Nichol et al., 2021; Saharia et al., 2022; Ramesh et al., 2022; Rombach et al., 2022; Sheynin et al., 2022; Bar-Tal et al., 2023). Such models have been extended for text-to-video generation, by extending 2D architectures to the temporal dimension (e.g., using temporal attention Ho et al. (2022b)) and performing large-scale training on video datasets (Ho et al., 2022a; Blattmann et al., 2023; Singer et al., 2022). Recently, Gen-1 (Esser et al., 2023) tailored a diffusion model architecture for the task of video editing, by conditioning the network on structure/appearance representations. Nevertheless, due to their extensive computation and memory requirements, existing video diffusion models are still in infancy and are largely restricted to short clips, or exhibit lower visual quality compared to image models. On the other side of the spectrum, a promising recent trend of works leverage a pre-trained image diffusion model for video synthesis tasks, without additional training (Fridman et al., 2023; Wu et al., 2022; Lee et al., 2023a; Qi et al., 2023). Our work falls into this category, employing a pretrained text-to-image diffusion model for the task of video editing, without any training or finetuning.

Consistent video stylization A common approach for video stylization involves applying image editing techniques (e.g., style transfer) on a frame-by-frame basis, followed by a post-processing stage to address temporal inconsistencies in the edited video (Lai et al. (2018b); Lei et al. (2020; 2023)). Although these methods effectively reduce high-frequency temporal flickering, they are not designed to handle frames that exhibit substantial variations in content, which often occur when applying text-based image editing techniques (Qi et al., 2023). Kasten et al. (2021) propose to decompose a video into a set of 2D atlases, each provides a unified representation of the background or of a foreground object throughout the video. Edits applied to the 2D atlases are automatically mapped back to the video, thus achieving temporal consistency with minimal effort. Bar-Tal et al. (2022); Lee et al. (2023b) leverage this representation to perform text-driven editing. However, the atlas representation is limited to videos with simple motion and requires long training, limiting the applicability of this technique and of the methods built upon it. Our work is also related to classical works that demonstrated that small patches in a natural video extensively repeat across frames (Shahar et al., 2011; Cheung et al., 2005), and thus consistent editing can by simplified by editing a subset of keyframes and propagating the edit across the video by establishing patch correspondences using handcrafted features and optical flow (Ruder et al., 2016; Jamriska et al., 2019) or by ˇ training a patch-based GAN (Texler et al., 2020). Nevertheless, such propagation methods struggle to handle videos with illumination changes, or with complex dynamics. Importantly, they rely on a user provided consistent edit of the keyframes, which remains a labor-intensive task yet to be automated. Yang et al. (2023) combines keyframe editing with a propagation method by Jamriska ˇ et al. (2019). They edit keyframes using a text-to-image diffusion model while enforcing optical flow constraints on the edited keyframes. However, since optical flow estimation between distant frames is not reliable, their method fails to consistently edit keyframes that are far apart (as seen in our Supplementary Material - SM), and as a result, fails to consistently edit most videos. Source Target I Target II Reconstructed Target (a) Warped Source Target (b) Nearest-Neighbour Field (c) Figure 2: Fine-grained feature correspondences. Features (i.e., output tokens from the self-attention modules) extracted from of a source frame are used to reconstruct nearby frames. This is done by: (a) swapping each feature in the target by its nearest feature in the source, in all layers and all generation time steps, and (b) simple warping in RGB space, using a nearest neighbour field (c), computed between the source and target features extracted from the highest resolution decoder layer. The target is faithfully reconstructed, demonstrating the high level of spatial granularity and shared content between the features. Our work shares a similar motivation as this approach that benefits from the temporal redundancies in natural videos. We show that such redundancies are also present in the feature space of a text-to-image diffusion model, and leverage this property to achieve consistency.

Figure 3: Diffusion features across time. Left: Given an input video (top row), we apply DDIM inversion on each frame and extract features from the highest resolution decoder layer in ϵθ. We apply PCA on the features (i.e., output tokens from the self-attention module) extracted from all frames and visualize the first three components (second row). We further visualize an x-t slice (marked in red on the original frame) for both RGB and features (bottom row). The feature representation is consistent across time – corresponding regions are encoded with similar features across the video. Middle: Frames and feature visualization for an edited video obtained by applying an image editing method (Tumanyan et al. (2023)) on each frame; inconsistent patterns in RGB are also evident in the feature space (e.g., on the dog’s body). Right: Our method enforces the edited video to convey the same level of feature consistency as the original video, which translates into a coherent and high-quality edit in RGB space.

Controlled generation via diffusion features manipulation Recently, a surge of works demonstrated how text-to-image diffusion models can be readily adapted to various editing and generation tasks, by performing simple operations on the intermediate feature representation of the diffusion network (Chefer et al., 2023; Hong et al., 2022; Ma et al., 2023; Tumanyan et al., 2023; Hertz et al., 2022; Patashnik et al., 2023; Cao et al., 2023). Luo et al. (2023); Zhang et al. (2023) demonstrated semantic appearance swapping using diffusion feature correspondences. Hertz et al. (2022) observed that by manipulating the cross-attention layers, it is possible to control the relation between the spatial layout of the image to each word in the text. Plugand-Play Diffusion (PnP, Tumanyan et al. (2023)) analyzed the spatial features and the self-attention maps and found that they capture semantic information at high spatial granularity. Tune-A-Video (Wu et al., 2022) observed that by extending the self-attention module to operate on more than a single frame, it is possible to generate frames that share a common global appearance. Qi et al. (2023); Ceylan et al. (2023); Khachatryan et al. (2023a); Shin et al. (2023); Liu et al. (2023) leverage this property to achieve globally-coherent video edits. Nevertheless, as demonstrated in Sec. 5, inflating the self-attention module is insufficient for achieving fine-grained temporal consistency. Prior and concurrent works either compromise visual quality, or exhibit limited temporal consistency. In this work, we also perform video editing via simple operations in the feature space of a pre-trained text-to-image model, we explicitly encourage the features of the model to be temporally consistent through TokenFlow.

Figure 2: Fine-grained feature correspondences. Features (i.e., output tokens from the self-attention modules) extracted from of a source frame are used to reconstruct nearby frames. This is done by: (a) swapping each feature in the target by its nearest feature in the source, in all layers and all generation time steps, and (b) simple warping in RGB space, usinga nearest neighbour field (c), computed between the source and target features extracted from the highest resolution decoder layer. The target is faithfully reconstructed, demonstrating the high level of spatial granularity and shared content between the features.

Figure 4: TokenFlow pipeline. Top: Given an input video I, we DDIM invert each frame, extract its tokens, i.e., output features from the self-attention modules, from each timestep and layer, and compute inter-frame features correspondences using a nearest-neighbor (NN) search. Bottom: The edited video is generated as follows: at each denoising step t, (I) we sample keyframes from the noisy video Jt and jointly edit them using an extended-attention block; the set of resulting edited tokens is Tbase. (II) We propagate the edited tokens across the video according to the pre-computed correspondences of the original video features. To denoise Jt, we feed each frame to the network, and replace the generated tokens with the tokens obtained from the propagation step (II).

This paper is available on arxiv under CC BY 4.0 DEED DEED license.

Authors:

(1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution;

(2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution;

(3) Shai Bagon, Weizmann Institute of Science;

(4) Tali Dekel, Weizmann Institute of Science.