告诉我发生了什么：通过多模式掩盖视频生成统一文本指导的视频完成

论文标题

告诉我发生了什么：通过多模式掩盖视频生成统一文本指导的视频完成

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

论文作者

Fu, Tsu-Jui, Yu, Licheng, Zhang, Ning, Fu, Cheng-Yang, Su, Jong-Chyi, Wang, William Yang, Bell, Sean

论文摘要

给定前几个静态框架生成视频是具有挑战性的，因为它可以预期具有时间连贯性的合理未来帧。除了视频预测外，从最后一帧倒带或头部和尾巴之间填充的能力也至关重要，但是很少探索它们以进行视频完成。由于仅几个帧的提示可能会有不同的结果，因此可以遵循自然语言进行视频完成的系统可能会显着提高可控性。受此启发的启发，我们介绍了一项新颖的任务，文本指导的视频完成（TVC），该任务要求该模型从由指令引导的部分帧中生成视频。然后，我们提出多模式掩盖视频生成（MMVG）来解决此TVC任务。在培训期间，MMVG将视频框架离散到视觉令牌中，并掩盖其中大多数以从任何时间点执行视频完成。在推理时，单个MMVG模型可以通过应用相应的掩蔽条件来解决TVC的所有3例TVC，包括视频预测，倒带和填充。我们在各种视频场景中评估了MMVG，包括以自我为中心，动画和游戏。广泛的实验结果表明，MMVG有效地通过TVC文本指南产生高质量的视觉外观。

Generating a video given the first several static frames is challenging as it anticipates reasonable future frames with temporal coherence. Besides video prediction, the ability to rewind from the last frame or infilling between the head and tail is also crucial, but they have rarely been explored for video completion. Since there could be different outcomes from the hints of just a few frames, a system that can follow natural language to perform video completion may significantly improve controllability. Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction. We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task. During training, MMVG discretizes the video frames into visual tokens and masks most of them to perform video completion from any time point. At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions. We evaluate MMVG in various video scenarios, including egocentric, animation, and gaming. Extensive experimental results indicate that MMVG is effective in generating high-quality visual appearances with text guidance for TVC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题