Omnimae：在图像和视频上进行单型掩盖预处理

论文标题

Omnimae：在图像和视频上进行单型掩盖预处理

OmniMAE: Single Model Masked Pretraining on Images and Videos

论文作者

Girdhar, Rohit, El-Nouby, Alaaeldin, Singh, Mannat, Alwala, Kalyan Vasudev, Joulin, Armand, Misra, Ishan

论文摘要

基于变压器的架构已在各种视觉域（最著名的图像和视频）上变得更具竞争力。虽然先前的工作研究这些模式，但拥有一个共同的体系结构表明，人们可以训练单个统一模型以多种视觉方式。事先尝试统一建模通常使用针对视觉任务量身定制的体系结构，或者与单个模态模型相比获得较差的性能。在这项工作中，我们表明可以使用蒙版的自动编码来在图像和视频上训练简单的视觉变压器，而无需任何标记的数据。该单个模型学习了与图像和视频基准上的单模式表示相当或更好的视觉表示，同时使用了更简单的体系结构。此外，可以通过丢弃90％的图像和95％的视频补丁来学习该模型，从而实现对巨大模型体系结构的极快训练。特别是，我们表明，我们的单个Vit-Huge模型可以对ImageNet上的86.6％和75.5％的挑战，在具有挑战性的事物V2视频基准上获得75.5％，从而创造了新的最先进。

Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work studies these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training of huge model architectures. In particular, we show that our single ViT-Huge model can be finetuned to achieve 86.6% on ImageNet and 75.5% on the challenging Something Something-v2 video benchmark, setting a new state-of-the-art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题