论文标题
机器翻译的特定领域特定文本生成
Domain-Specific Text Generation for Machine Translation
论文作者
论文摘要
在任何翻译工作流程中,从源到目标的域知识保存至关重要。在翻译行业中很常见,在几乎没有任何平行的内域数据的情况下接收高度专业化的项目。在这种情况下,没有足够的内域数据来微调机器翻译(MT)模型,产生与相关上下文一致的翻译很具有挑战性。在这项工作中,我们提出了一种新颖的方法,用于针对MT的特定于域特异性数据增强的最先进的语言模型(LMS),以模拟(a)(a)小型双语数据集的域特征,或(b)单语源文本。将这个想法与反向翻译相结合,我们可以在两个用例中生成大量的合成双语内域数据。为了进行调查,我们使用最先进的变压器体系结构。我们采用混合的微调来训练模型,从而显着改善了内域文本的翻译。更具体地说,在这两种情况下,我们提出的方法分别在阿拉伯语到英语对阿拉伯语言对上分别提高了大约5-6个BLEU和2-3 BLEU。此外,人类评估的结果证实了自动评估结果。
Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly specialized projects, where there is hardly any parallel in-domain data. In such scenarios where there is insufficient in-domain data to fine-tune Machine Translation (MT) models, producing translations that are consistent with the relevant context is challenging. In this work, we propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either (a) a small bilingual dataset, or (b) the monolingual source text to be translated. Combining this idea with back-translation, we can generate huge amounts of synthetic bilingual in-domain data for both use cases. For our investigation, we use the state-of-the-art Transformer architecture. We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts. More specifically, in both scenarios, our proposed methods achieve improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on the Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of human evaluation corroborates the automatic evaluation results.