MagicVideo：具有潜扩散模型的有效视频生成

论文标题

MagicVideo：具有潜扩散模型的有效视频生成

MagicVideo: Efficient Video Generation With Latent Diffusion Models

论文作者

Zhou, Daquan, Wang, Weimin, Yan, Hanshu, Lv, Weiwei, Zhu, Yizhe, Feng, Jiashi

论文摘要

我们提出了一个有效的文本到视频生成框架，该框架基于被称为MagicVideo的潜扩散模型。 MagicVideo可以生成光滑的视频剪辑，这些视频剪辑与给定的文本说明一致。由于在低维空间中的新颖有效的3D U-NET设计和建模视频分布，MagicVideo可以在单个GPU卡上使用256x256空间分辨率合成视频片段，该卡在单个GPU卡上的计算比视频扩散模型（VDM）少64倍。在特定的，与直接在RGB空间中训练视频模型的现有作品不同，我们使用预先训练的VAE将视频片段映射到低维的潜在空间中，并通过扩散模型了解视频潜在代码的分布。此外，我们介绍了两种新设计，以使经过图像任务的U-NET DENOISER适应视频数据：用于图像到视频分布调整的框架轻巧适配器和有向的时间注意模块以捕获跨帧的时间依赖性。因此，我们可以从文本到图像模型中利用卷积运算符的信息权重，以加速视频培训。为了改善生成视频中的像素抖动，我们还提出了一个新颖的视频自动编码器，以更好地重建RGB。我们进行了广泛的实验，并证明MagicVideo可以生成具有逼真或虚构内容的高质量视频剪辑。有关更多示例，请参阅\ url {https://magicvideo.github.io/#}。

We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips that are concordant with the given text descriptions. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card, which takes around 64x fewer computations than the Video Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works that directly train video models in the RGB space, we use a pre-trained VAE to map video clips into a low-dimensional latent space and learn the distribution of videos' latent codes via a diffusion model. Besides, we introduce two new designs to adapt the U-Net denoiser trained on image tasks to video data: a frame-wise lightweight adaptor for the image-to-video distribution adjustment and a directed temporal attention module to capture temporal dependencies across frames. Thus, we can exploit the informative weights of convolution operators from a text-to-image model for accelerating video training. To ameliorate the pixel dithering in the generated videos, we also propose a novel VideoVAE auto-encoder for better RGB reconstruction. We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content. Refer to \url{https://magicvideo.github.io/#} for more examples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题