论文标题

动态数据选择和权重,用于迭代反向翻译

Dynamic Data Selection and Weighting for Iterative Back-Translation

论文作者

Dou, Zi-Yi, Anastasopoulos, Antonios, Neubig, Graham

论文摘要

事实证明,反向翻译是在神经机器翻译(NMT)中使用单语数据的有效方法,并且反复进行反向翻译可以进一步改善模型性能。选择哪种单语言数据对后翻译至关重要,因为我们要求所得的合成数据具有高质量并反映目标域。为了实现这两个目标,已经提出了数据选择和加权策略,常见的做法是选择接近目标域的样本,但也与平均通用域文本不同。在本文中,我们提供了对这种常用方法的见解,并将其推广到动态课程学习策略,该策略应用于迭代反向翻译模型。此外,我们根据句子的当前质量及其对先前迭代的改善提出加权策略。我们将模型评估有关域适应性,低资源和高资源MT设置以及两对语言对的评估。实验结果表明,我们的方法可在竞争基准中提高高达1.8个BLEU点。

Back-translation has proven to be an effective method to utilize monolingual data in neural machine translation (NMT), and iteratively conducting back-translation can further improve the model performance. Selecting which monolingual data to back-translate is crucial, as we require that the resulting synthetic data are of high quality and reflect the target domain. To achieve these two goals, data selection and weighting strategies have been proposed, with a common practice being to select samples close to the target domain but also dissimilar to the average general-domain text. In this paper, we provide insights into this commonly used approach and generalize it to a dynamic curriculum learning strategy, which is applied to iterative back-translation models. In addition, we propose weighting strategies based on both the current quality of the sentence and its improvement over the previous iteration. We evaluate our models on domain adaptation, low-resource, and high-resource MT settings and on two language pairs. Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源