神经机器翻译的数据选择课程

论文标题

神经机器翻译的数据选择课程

Data Selection Curriculum for Neural Machine Translation

论文作者

Mohiuddin, Tasnim, Koehn, Philipp, Chaudhary, Vishrav, Cross, James, Bhosale, Shruti, Joty, Shafiq

论文摘要

神经机器翻译（NMT）模型通常是对串联和随机洗牌的异质数据进行训练的。但是，并非所有培训数据对模型同样有用。课程培训旨在以有意义的顺序向NMT模型提供数据。在这项工作中，我们为NMT引入了一个两阶段的课程培训框架，在该课程中，我们在数据子集上微调了基本的NMT模型，该模型是通过使用预训练的方法和在线评分来选择的，以考虑新兴NMT模型的预测得分。通过对WMT'21的低资源和高资源语言组成的六对语言的全面实验，我们已经表明，我们的课程策略始终显示出质量更好（最大+2.2 BLEU改善）和更快的收敛性（大约更少更新）。

Neural Machine Translation (NMT) models are typically trained on heterogeneous data that are concatenated and randomly shuffled. However, not all of the training data are equally useful to the model. Curriculum training aims to present the data to the NMT models in a meaningful order. In this work, we introduce a two-stage curriculum training framework for NMT where we fine-tune a base NMT model on subsets of data, selected by both deterministic scoring using pre-trained methods and online scoring that considers prediction scores of the emerging NMT model. Through comprehensive experiments on six language pairs comprising low- and high-resource languages from WMT'21, we have shown that our curriculum strategies consistently demonstrate better quality (up to +2.2 BLEU improvement) and faster convergence (approximately 50% fewer updates).

下载PDF全文

下载文献需遵守相关版权规定

论文标题