论文标题
具有机器翻译文本的多语言多模式学习
Multilingual Multimodal Learning with Machine Translated Text
论文作者
论文摘要
大多数视觉和语言审计的研究都集中在英语任务上。但是,创建多语言多模式评估数据集(例如Multi30k,XGQA,XVNLI和MARVL)在寻找多语言和多模态的高质量培训数据方面提出了一个新的挑战。在本文中,我们调查了翻译英语多模式数据的机器是否可以有效代理,因为缺乏随时可用的多语言数据。我们称此框架TD-MML:用于多语言多模式学习的翻译数据,可以应用于任何多模式数据集和模型。我们使用最先进的模型将其应用于训练和微调数据。为了防止模型从低质量翻译的文本中学习,我们提出了两个指标,以自动从生成的数据集中删除此类翻译。在Iglue基准测试中20种语言的五个任务的实验中,我们表明翻译数据可以为多语言多模式学习提供有用的信号,无论是在训练和微调方面。
Most vision-and-language pretraining research focuses on English tasks. However, the creation of multilingual multimodal evaluation datasets (e.g. Multi30K, xGQA, XVNLI, and MaRVL) poses a new challenge in finding high-quality training data that is both multilingual and multimodal. In this paper, we investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data. We call this framework TD-MML: Translated Data for Multilingual Multimodal Learning, and it can be applied to any multimodal dataset and model. We apply it to both pretraining and fine-tuning data with a state-of-the-art model. In order to prevent models from learning from low-quality translated text, we propose two metrics for automatically removing such translations from the resulting datasets. In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning, both at pretraining and fine-tuning.