Interbert：多模式预处理的视觉和语言相互作用

论文标题

Interbert：多模式预处理的视觉和语言相互作用

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

论文作者

Lin, Junyang, Yang, An, Zhang, Yichang, Liu, Jie, Zhou, Jingren, Yang, Hongxia

论文摘要

学习高级多模式代表的多模式预处理是朝着深度学习和人工智能迈出的进一步一步。在这项工作中，我们提出了一个新型模型，即Interbert（bert用于互动），这是我们系列多模式预处理方法M6（多态性到异样的多态多任务巨型转换器）的第一个模型。该模型具有强大的能力，可以在不同方式的信息流之间建模相互作用。单流相互作用模块能够有效地处理多个模元的信息，并且顶部的两流模块可保留每种模式的独立性，以避免在单模式任务中降级性能。我们通过三个预处理任务预先限制了该模型，包括蒙版段建模（MSM），蒙版区域建模（MRM）和图像文本匹配（ITM）；并在一系列视觉和语言下游任务上为模型而言。实验结果表明，Interbert的表现优于一系列强大的基线，包括最新的多模式预处理方法，分析表明MSM和MRM对于预训练有效，我们的方法可以实现与单模式任务中BERT相当的性能。此外，我们提出了一个大型数据集，用于中文的多模式预处理，我们开发了中国的Interbert，这是第一个中国多模式预审预周座的模型。我们从最大的中国电子商务平台移动淘宝（Mobile Tamobao）的310万个图像文本对数据集上预先介绍了中国的Interbert。我们为基于文本的图像检索而定为模型，最近我们在线部署了基于主题的建议的模型。

Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题