MASKVIT：视频预测的掩盖视觉预训练

论文标题

MASKVIT：视频预测的掩盖视觉预训练

MaskViT: Masked Visual Pre-Training for Video Prediction

论文作者

Gupta, Agrim, Tian, Stephen, Zhang, Yunzhi, Wu, Jiajun, Martín-Martín, Roberto, Fei-Fei, Li

论文摘要

预测以过去观察和电动机命令为条件的未来视觉观察的能力可以使体现的代理能够计划解决复杂环境中各种任务的解决方案。这项工作表明，我们可以通过掩盖的视觉建模预训练变压器来创建良好的视频预测模型。我们的方法名为MaskVit，基于两个简单的设计决策。首先，为了记忆和训练效率，我们使用两种类型的窗户注意力：时空和时空。其次，在训练过程中，我们掩盖了一个可变百分比的令牌，而不是固定掩膜比。对于推断，MaskVit通过迭代改进生成所有令牌，在该迭代中，我们会在掩码调度函数后逐渐降低掩蔽率。在几个数据集上，我们证明了MaskVit优于视频预测中的先前作品，是参数有效的，并且可以生成高分辨率视频（256x256）。此外，我们通过使用MaskVit在真实机器人上计划进行迭代解码，证明了推理加速器的好处（最高512倍）。我们的工作表明，我们可以通过利用最小的域知识的掩盖视觉建模的一般框架来赋予体现的代理具有强大的预测模型。

The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high-resolution videos (256x256). Further, we demonstrate the benefits of inference speedup (up to 512x) due to iterative decoding by using MaskViT for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题