预训练的多语言序列到序列模型：低资源语言翻译的希望？

论文标题

预训练的多语言序列到序列模型：低资源语言翻译的希望？

Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?

论文作者

Lee, En-Shiun Annie, Thillainathan, Sarubi, Nayak, Shravan, Ranathunga, Surangika, Adelani, David Ifeoluwa, Su, Ruisi, McCarthy, Arya D.

论文摘要

像Mbart这样的预训练的多语言序列到序列模型可以有助于翻译低资源语言？我们对10种语言进行了彻底的经验实验，以确定这一点，考虑到五个因素：（1）微调数据的量，（2）微调数据中的噪声，（3）模型中训练数据的量，（4）域不匹配的影响，以及（5）语言类型。除了产生多种启发式方法外，实验还构成了评估机器翻译系统数据敏感性的框架。尽管MBART对域差异是可靠的，但其对看不见和类型的遥远语言的翻译仍低于3.0 BLEU。为了回答我们标题的问题，姆巴特不是低资源的灵丹妙药。因此，我们鼓励将重点从新模型转移到新数据。

What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource languages? We conduct a thorough empirical experiment in 10 languages to ascertain this, considering five factors: (1) the amount of fine-tuning data, (2) the noise in the fine-tuning data, (3) the amount of pre-training data in the model, (4) the impact of domain mismatch, and (5) language typology. In addition to yielding several heuristics, the experiments form a framework for evaluating the data sensitivities of machine translation systems. While mBART is robust to domain differences, its translations for unseen and typologically distant languages remain below 3.0 BLEU. In answer to our title's question, mBART is not a low-resource panacea; we therefore encourage shifting the emphasis from new models to new data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题