多模式表示学习的蒙版视觉和语言建模

论文标题

多模式表示学习的蒙版视觉和语言建模

Masked Vision and Language Modeling for Multi-modal Representation Learning

论文作者

Kwon, Gukyeong, Cai, Zhaowei, Ravichandran, Avinash, Bas, Erhan, Bhotika, Rahul, Soatto, Stefano

论文摘要

在本文中，我们研究了如何在视觉和语言（V+L）表示学习中使用蒙版的信号建模。与其独立开发蒙面语言建模（MLM）和蒙面图像建模（MIM），我们建议建立关节蒙版的视觉和语言建模，其中一种模式的掩盖信号是在另一种方式的帮助下重建的。这是由图像文本配对数据的性质进行的，这些数据既传达了几乎相同的信息，却以不同的格式传达。一种以另一种方式来调节的一种方式的掩盖信号重建也可以隐式学习语言令牌和图像贴片之间的跨模式对齐。我们对各种V+L任务的实验表明，该方法以及常见的V+L对准损失，在数百万培训数据的制度中实现了最先进的表现。此外，在有限的数据方案中，我们的表现要优于其他竞争对手。

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. Also, we outperforms the other competitors by a significant margin in limited data scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题