论文标题
重复使用有限的语言语言的语言模型,用于无监督的NMT
Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT
论文作者
论文摘要
使用在具有大量单语言数据的两种语言上仔细预测的语言模型(LM),以便初始化无监督的神经机器翻译(UNMT)系统可产生最新的结果。但是,当一种语言可用的数据有限时,此方法会导致翻译不良。我们提出了一种有效的方法,它可以重用仅在高资源语言上鉴定的LM。单语LM对两种语言进行了微调,然后用于初始化UNMT模型。为了重复使用验证的LM,我们必须修改其预定义的词汇,以解释新语言。因此,我们提出了一种新颖的词汇扩展方法。我们的方法Re-LM优于英语 - 麦角(EN-MK)和英语 - 阿尔巴尼亚(EN-SQ)的竞争性跨语性预读模型(XLM),在所有四个翻译方向上都产生了超过8.3个BLEU点。
Using a language model (LM) pretrained on two languages with large monolingual data in order to initialize an unsupervised neural machine translation (UNMT) system yields state-of-the-art results. When limited data is available for one language, however, this method leads to poor translations. We present an effective approach that reuses an LM that is pretrained only on the high-resource language. The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model. To reuse the pretrained LM, we have to modify its predefined vocabulary, to account for the new language. We therefore propose a novel vocabulary extension method. Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq), yielding more than +8.3 BLEU points for all four translation directions.