论文标题
Ernie-unix2:一个统一的跨语性跨模式框架,用于理解和发电
ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation
论文作者
论文摘要
最近的跨语性跨模式作品试图将视觉语言预训练(VLP)模型扩展到非英语输入并实现令人印象深刻的性能。但是,这些模型仅着重于理解使用仅经常体系结构的任务。在本文中,我们提出了Ernie-Unix2,这是一个统一的跨语言跨模式预训练框架,用于生成和理解任务。 Ernie-unix2基于编码器编码器体系结构集成了多个预训练范式(例如,对比度学习和语言建模),并试图跨语言和模态学习更好的联合表示。此外,Ernie-unix2可以无缝调整,以供各种发电和理解下游任务。 Ernie-unix2在多语言文本和图像文本数据集上进行了预训练,可在各种跨语性的跨模式生成中获得SOTA的结果,并理解诸如多模式机器翻译和多语言视觉质疑的任务。
Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models to non-English inputs and achieve impressive performance. However, these models focus only on understanding tasks utilizing encoder-only architecture. In this paper, we propose ERNIE-UniX2, a unified cross-lingual cross-modal pre-training framework for both generation and understanding tasks. ERNIE-UniX2 integrates multiple pre-training paradigms (e.g., contrastive learning and language modeling) based on encoder-decoder architecture and attempts to learn a better joint representation across languages and modalities. Furthermore, ERNIE-UniX2 can be seamlessly fine-tuned for varieties of generation and understanding downstream tasks. Pre-trained on both multilingual text-only and image-text datasets, ERNIE-UniX2 achieves SOTA results on various cross-lingual cross-modal generation and understanding tasks such as multimodal machine translation and multilingual visual question answering.