MVP：多模式引导的视觉预训练

论文标题

MVP：多模式引导的视觉预训练

MVP: Multimodality-guided Visual Pre-training

论文作者

Wei, Longhui, Xie, Lingxi, Zhou, Wengang, Li, Houqiang, Tian, Qi

论文摘要

最近，蒙面图像建模（MIM）已成为视觉预训练的有希望的方向。在视觉变压器的背景下，MIM通过将令牌级特征与预定义的空间对齐（例如，BEIT使用了在大图像语料库中训练的D-VAE作为令牌）来学习有效的视觉表示。在本文中，我们通过引入其他模式的指导进一步迈出一步，并验证了这种额外的知识会为视觉预训练带来令人印象深刻的收益。所提出的方法命名为多模式指导的视觉预训练（MVP），其中我们用剪辑的视觉分支代替了令牌，这是一种在4亿张图像文本对上预先训练的视觉模型。我们通过执行标准实验，即对ImageNet上的VIT模型进行预训练，并在一系列下游的视觉识别任务上对其进行微调来证明MVP的有效性。特别是，MVP在ADE20K上报告了52.4％的MIOU，超过BEIT（基线和先前的最先前），令人印象深刻的余量为6.8％。

Recently, masked image modeling (MIM) has become a promising direction for visual pre-training. In the context of vision transformers, MIM learns effective visual representation by aligning the token-level features with a pre-defined space (e.g., BEIT used a d-VAE trained on a large image corpus as the tokenizer). In this paper, we go one step further by introducing guidance from other modalities and validating that such additional knowledge leads to impressive gains for visual pre-training. The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which we replace the tokenizer with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs. We demonstrate the effectiveness of MVP by performing standard experiments, i.e., pre-training the ViT models on ImageNet and fine-tuning them on a series of downstream visual recognition tasks. In particular, pre-training ViT-Base/16 for 300 epochs, MVP reports a 52.4% mIoU on ADE20K, surpassing BEIT (the baseline and previous state-of-the-art) with an impressive margin of 6.8%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题