影像视频：带扩散模型的高清视频生成

论文标题

影像视频：带扩散模型的高清视频生成

Imagen Video: High Definition Video Generation with Diffusion Models

论文作者

Ho, Jonathan, Chan, William, Saharia, Chitwan, Whang, Jay, Gao, Ruiqi, Gritsenko, Alexey, Kingma, Diederik P., Poole, Ben, Norouzi, Mohammad, Fleet, David J., Salimans, Tim

论文摘要

我们提供Imagen Video，这是一种基于一系列视频扩散模型的文本条件视频生成系统。给定文本提示，Imagen视频使用基本视频生成模型以及一系列交织的空间和时间视频超分辨率模型生成了高清视频。我们描述了如何将系统扩展为高清文本对视频模型，其中包括设计决策，例如在某些分辨率上选择全趋化的时间和空间超分辨率模型，以及选择扩散模型的V-参数化的选择。此外，我们从先前的基于扩散的图像生成的工作中确认并将发现转移到视频生成设置。最后，我们将渐进式蒸馏应用于我们的视频模型，并使用无分类器指南进行快速，高质量抽样的指南。我们发现Imagen视频不仅能够生成高保真度的视频，而且具有高度的可控性和世界知识，包括能够以各种艺术风格以及3D对象理解生成各种视频和文本动画。有关样本，请参见https://imagen.research.google/video/。

We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confirm and transfer findings from previous work on diffusion-based image generation to the video generation setting. Finally, we apply progressive distillation to our video models with classifier-free guidance for fast, high quality sampling. We find Imagen Video not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding. See https://imagen.research.google/video/ for samples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题