phanaki：敞开域文本描述的可变长度的视频生成

论文标题

phanaki：敞开域文本描述的可变长度的视频生成

Phenaki: Variable Length Video Generation From Open Domain Textual Description

论文作者

Villegas, Ruben, Babaeizadeh, Mohammad, Kindermans, Pieter-Jan, Moraldo, Hernan, Zhang, Han, Saffar, Mohammad Taghi, Castro, Santiago, Kunze, Julius, Erhan, Dumitru

论文摘要

我们提出了一系列文本提示，我们提出了一种能够实现现实视频综合的模型。由于计算成本，数量有限的高质量文本视频数据和视频长度，因此从文本中生成视频尤其具有挑战性。为了解决这些问题，我们介绍了一种学习视频表示的新模型，该模型将视频压缩为一小部分离散令牌。该令牌仪会随着时间的推移使用因果关注，这使其可以与可变的视频一起使用。为了从文本中生成视频令牌，我们使用的是在预先计算的文本令牌上进行的双向蒙版变压器。随后对生成的视频令牌进行了解剖，以创建实际的视频。为了解决数据问题，我们演示了大量图像文本对的联合培训以及少量的视频文本示例如何导致概括视频数据集中的可用内容。与以前的视频生成方法相比，phanaki可以在开放域中以一系列提示（即时间变量文本或故事）来生成任意的长视频。据我们所知，这是纸质研究第一次从时间变量提示中生成视频。此外，与人均基线相比，提出的视频编码器计算每个视频的代币较少，但会导致更好的时空一致性。

We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题