多模式预处理的揭露：荟萃分析和视觉和语言的统一框架

论文标题

多模式预处理的揭露：荟萃分析和视觉和语言的统一框架

Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

论文作者

Bugliarello, Emanuele, Cotterell, Ryan, Okazaki, Naoaki, Elliott, Desmond

论文摘要

现在，大规模预处理和特定于任务的微调是计算机视觉和自然语言处理中许多任务的标准方法。最近，已经提出了多种方法，用于在AI的这两个关键领域的交汇处应对挑战，以应对挑战和语言。这些模型可以分为单流或双流编码器。我们研究了这两个类别之间的差异，并显示如何在单个理论框架下统一它们。然后，我们进行受控的实验，以辨别五个V＆L Berts之间的经验差异。我们的实验表明，训练数据和超参数负责报告的结果之间的大多数差异，但它们还表明，嵌入层在这些大型模型中起着至关重要的作用。

Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorised into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five V&L BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题