Ernie-unix2：一个统一的跨语性跨模式框架，用于理解和发电

论文标题

Ernie-unix2：一个统一的跨语性跨模式框架，用于理解和发电

ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation

论文作者

Shan, Bin, Han, Yaqian, Yin, Weichong, Wang, Shuohuan, Sun, Yu, Tian, Hao, Wu, Hua, Wang, Haifeng

论文摘要

最近的跨语性跨模式作品试图将视觉语言预训练（VLP）模型扩展到非英语输入并实现令人印象深刻的性能。但是，这些模型仅着重于理解使用仅经常体系结构的任务。在本文中，我们提出了Ernie-Unix2，这是一个统一的跨语言跨模式预训练框架，用于生成和理解任务。 Ernie-unix2基于编码器编码器体系结构集成了多个预训练范式（例如，对比度学习和语言建模），并试图跨语言和模态学习更好的联合表示。此外，Ernie-unix2可以无缝调整，以供各种发电和理解下游任务。 Ernie-unix2在多语言文本和图像文本数据集上进行了预训练，可在各种跨语性的跨模式生成中获得SOTA的结果，并理解诸如多模式机器翻译和多语言视觉质疑的任务。

Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models to non-English inputs and achieve impressive performance. However, these models focus only on understanding tasks utilizing encoder-only architecture. In this paper, we propose ERNIE-UniX2, a unified cross-lingual cross-modal pre-training framework for both generation and understanding tasks. ERNIE-UniX2 integrates multiple pre-training paradigms (e.g., contrastive learning and language modeling) based on encoder-decoder architecture and attempts to learn a better joint representation across languages and modalities. Furthermore, ERNIE-UniX2 can be seamlessly fine-tuned for varieties of generation and understanding downstream tasks. Pre-trained on both multilingual text-only and image-text datasets, ERNIE-UniX2 achieves SOTA results on various cross-lingual cross-modal generation and understanding tasks such as multimodal machine translation and multilingual visual question answering.

下载PDF全文

下载文献需遵守相关版权规定

论文标题