通过利用对齐信息的训练预训练多语言神经机器翻译

论文标题

通过利用对齐信息的训练预训练多语言神经机器翻译

Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information

论文作者

Lin, Zehui, Pan, Xiao, Wang, Mingxuan, Qiu, Xipeng, Feng, Jiangtao, Zhou, Hao, Li, Lei

论文摘要

我们调查了机器翻译（MT）的以下问题：我们可以开发单个通用MT模型以作为常见种子并在任意语言对上获得衍生和改进的模型吗？我们提出了MRASP，这是一种预先训练通用多语言神经机器翻译模型的方法。我们在MRASP中的关键思想是其新颖的随机对齐替代技术，它带来了在表示空间中更接近多种语言的单词和短语。我们将MRASP模型预先训练32个语言对，仅与公共数据集共同进行。然后，在下游语言对上进行微调，以获取专业的MT模型。我们在各种环境中进行了42个翻译方向进行广泛的实验，包括低，中等，丰富的资源，以及转移到异国语言对。实验结果表明，与直接训练这些目标对相比，MRASP可以取得显着的性能提高。这是第一次验证可以利用多种低资源语言对来改善丰富的资源MT。令人惊讶的是，MRASP甚至能够提高培训前语料库中从未发生过的异国语言的翻译质量。代码，数据和预训练的模型可在https://github.com/linzehui/mrasp上找到。

We investigate the following question for machine translation (MT): can we develop a single universal MT model to serve as the common seed and obtain derivative and improved models on arbitrary language pairs? We propose mRASP, an approach to pre-train a universal multilingual neural machine translation model. Our key idea in mRASP is its novel technique of random aligned substitution, which brings words and phrases with similar meanings across multiple languages closer in the representation space. We pre-train a mRASP model on 32 language pairs jointly with only public datasets. The model is then fine-tuned on downstream language pairs to obtain specialized MT models. We carry out extensive experiments on 42 translation directions across a diverse settings, including low, medium, rich resource, and as well as transferring to exotic language pairs. Experimental results demonstrate that mRASP achieves significant performance improvement compared to directly training on those target pairs. It is the first time to verify that multiple low-resource language pairs can be utilized to improve rich resource MT. Surprisingly, mRASP is even able to improve the translation quality on exotic languages that never occur in the pre-training corpus. Code, data, and pre-trained models are available at https://github.com/linzehui/mRASP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题