论文标题
部分可观测时空混沌系统的无模型预测
Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning
论文作者
论文摘要
有条件视频合成的大多数方法都将单个模态作为条件。这有主要限制。例如,对于在图像上进行的模型而言,由于没有手段提供运动信息,因此在图像上生成特定运动轨迹是有问题的。相反,语言信息可以描述所需的运动,而不能精确地定义视频的内容。这项工作提出了一个多模式的视频生成框架,该框架从共同或单独提供的文本和图像中受益。我们利用视频量化表示的最新进展,并应用具有多种模式的双向变压器作为输入来预测离散的视频表示。为了提高视频质量和一致性,我们提出了一个新的视频令牌,该标记训练了自学和改进的面具预测算法,用于采样视频令牌。我们介绍文本增强,以提高文本表示的鲁棒性和生成视频的多样性。我们的框架可以结合各种视觉方式,例如分割面具,图纸和部分遮挡的图像。它可以产生比用于训练的序列更长的序列。此外,我们的模型可以按照文本提示提示的视觉信息提取视觉信息,例如“图像中的一个对象正在向东北移动”,并生成相应的视频。我们对三个公共数据集进行了评估,以及一个标有面部属性的新收集的数据集,在所有四个方面都取得了最新的生成结果。
Most methods for conditional video synthesis use a single modality as the condition. This comes with major limitations. For example, it is problematic for a model conditioned on an image to generate a specific motion trajectory desired by the user since there is no means to provide motion information. Conversely, language information can describe the desired motion, while not precisely defining the content of the video. This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We leverage the recent progress in quantized representations for videos and apply a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. To improve video quality and consistency, we propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens. We introduce text augmentation to improve the robustness of the textual representation and diversity of generated videos. Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images. It can generate much longer sequences than the one used for training. In addition, our model can extract visual information as suggested by the text prompt, e.g., "an object in image one is moving northeast", and generate corresponding videos. We run evaluations on three public datasets and a newly collected dataset labeled with facial attributes, achieving state-of-the-art generation results on all four.