在神经机器翻译中找到合适的资源域改编的配方

论文标题

在神经机器翻译中找到合适的资源域改编的配方

Finding the Right Recipe for Low Resource Domain Adaptation in Neural Machine Translation

论文作者

Adams, Virginia, Subramanian, Sandeep, Chrzanowski, Mike, Hrinchuk, Oleksii, Kuchaiev, Oleksii

论文摘要

通用翻译模型通常仍然难以在专用域中产生准确的翻译。为了指导机器翻译从业人员，并在不同的数据可用性方案下表征了域适应方法的有效性，我们对单语言和平行数据方法进行了深入的经验探索，以使域的预训练，第三方，NMT模型的域适应域中的设置变化。我们比较了以数据为中心的适应方法隔离和组合。我们研究了非常低的资源（8K平行示例）和中等低资源（46K平行示例）条件的方法有效性，并提出了一种合奏方法，以减轻原始域翻译质量的降低。我们的工作包括三个领域：消费电子，临床和生物医学，并跨越四个语言对 - ZH-en，ja-en，es-en和ru-en。我们还提出了具体的建议，以实现高域内绩效，并发布所有语言的消费者电子和医疗域数据集，并使我们的代码公开可用。

General translation models often still struggle to generate accurate translations in specialized domains. To guide machine translation practitioners and characterize the effectiveness of domain adaptation methods under different data availability scenarios, we conduct an in-depth empirical exploration of monolingual and parallel data approaches to domain adaptation of pre-trained, third-party, NMT models in settings where architecture change is impractical. We compare data centric adaptation methods in isolation and combination. We study method effectiveness in very low resource (8k parallel examples) and moderately low resource (46k parallel examples) conditions and propose an ensemble approach to alleviate reductions in original domain translation quality. Our work includes three domains: consumer electronic, clinical, and biomedical and spans four language pairs - Zh-En, Ja-En, Es-En, and Ru-En. We also make concrete recommendations for achieving high in-domain performance and release our consumer electronic and medical domain datasets for all languages and make our code publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题