Fly-A-Video:
Frequency-Level Frame-Consistency Diffusion Models for General Video Editing

Chen Chen1 Mengjie Bian1 Bin Song1 Xinbo Gao2

1 Xidian University 2 Chongqing University of Posts and Telecommunications

\
\
\

 

 

Abstract

Recently, there has been notable success in the field of video editing through the integration of text-image-driven techniques based on diffusion models. However, challenges persist, particularly in handling complex dynamic scenes characterized by substantial pixel variations between consecutive frames. Existing video editing methods struggle with frame inconsistency issues, leading to key content loss and color disruptions in motion.

In our research, we present Fly-A-Video, a pioneering general-purpose video editing framework designed to address these challenges. We leverage a frequency-level frame-consistency diffusion model to generate high-quality, coherent videos. Our approach involves two key innovations:

  1. Frequency Optimization Module: We propose the incorporation of a discrete cosine transform (DCT) in the diffusion model to establish a frequency optimization module. This module decomposes temporal features of the video into distinct frequency components. Notably, the low-frequency component encapsulates crucial content features, while the high-frequency component captures differences between adjacent frames. By reinforcing the significance of low-frequency components and mitigating the impact of high-frequency ones, we enhance frame consistency in video sequence generation.
  2. Frequency-Level Attention Mechanism: Additionally, we introduce a straightforward frequency-level attention mechanism. This mechanism calculates correlations using frequency domain features of the current frame and all frames in the video, contributing to improved frame consistency during video editing.

We conducted a thorough evaluation of our approach across diverse video datasets, showcasing its superior performance in terms of visual coherence quality, diversity, and textual content consistency compared to existing methods.




Introduction

Video Player




Method


Fly-A-Video: Given a text and image-video pair (e.g., "an suv driving in the snow”) as input, our method leverages the frame-consistency diffusion models for general editing. We mainly modify the Attention layers and introduce the Frequency optimization module.

 



Results


Text-driven Video Editing


Image-driven Video Editing


General Video Editing


General Video Editing with Style



Qualitative comparison with other methods


Text-driven Video Editing


General Video Editing



Ablation study


W/o Frequence Optimization Module


W/o Attention Layers


Comparison of different C values