论文标题

迈向多语言机器翻译中的接下来的1000种语言:探索监督和自我监督学习之间的协同作用

Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning

论文作者

Siddhant, Aditya, Bapna, Ankur, Firat, Orhan, Cao, Yuan, Chen, Mia Xu, Caswell, Isaac, Garcia, Xavier

论文摘要

所有人类语言对之间的普遍翻译是机器翻译(MT)研究的圣地。尽管大规模多语言MT的最新进展离达到这一目标更近了,但越来越明显的是,仅通过对更平行数据进行培训来扩展多语言MT系统是不计值的,因为在低资源和非英国中性语言对的标记数据中的可用性被禁止受到限制。为此,我们提出了一种务实的方法,用于构建一种多语言MT模型,该模型涵盖了数百种语言,并使用监督和自我监督的目标混合,具体取决于不同语言对的数据可用性。我们证明,这两个训练范式之间的协同作用使该模型能够在零资源环境中产生高质量的翻译,甚至超过了低资源和中水资源语言的监督翻译质量。我们进行了各种各样的实验,以了解多语言监督,领域不匹配以及平行和单语的数据的效果,这些数据涉及我们自我监督的多语言模型的质量。为了证明该方法的可扩展性,我们训练具有200多种语言的模型,并在几种以前研究的语言上展示了零资源翻译的高性能。我们希望我们的发现将成为启用接下来的千种语言翻译的垫脚石。

Achieving universal translation between all human language pairs is the holy-grail of machine translation (MT) research. While recent progress in massively multilingual MT is one step closer to reaching this goal, it is becoming evident that extending a multilingual MT system simply by training on more parallel data is unscalable, since the availability of labeled data for low-resource and non-English-centric language pairs is forbiddingly limited. To this end, we present a pragmatic approach towards building a multilingual MT model that covers hundreds of languages, using a mixture of supervised and self-supervised objectives, depending on the data availability for different language pairs. We demonstrate that the synergy between these two training paradigms enables the model to produce high-quality translations in the zero-resource setting, even surpassing supervised translation quality for low- and mid-resource languages. We conduct a wide array of experiments to understand the effect of the degree of multilingual supervision, domain mismatches and amounts of parallel and monolingual data on the quality of our self-supervised multilingual models. To demonstrate the scalability of the approach, we train models with over 200 languages and demonstrate high performance on zero-resource translation on several previously under-studied languages. We hope our findings will serve as a stepping stone towards enabling translation for the next thousand languages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源