与文档一致的日语对话平行语料库

论文标题

与文档一致的日语对话平行语料库

Document-aligned Japanese-English Conversation Parallel Corpus

论文作者

Rikters, Matīss, Ri, Ryokan, Li, Tong, Nakazawa, Toshiaki

论文摘要

句子级别（SL）机器翻译（MT）已经针对许多高资源语言达到了可接受的质量，但没有文档级（DL）MT，这很难1）训练很少的DL数据； 2）评估，因为主要方法和数据集的重点是SL评估。为了解决第一个问题，我们提出了一个与文档一致的日语对话语料库，包括用于调整和测试的平衡，高质量的商业对话数据。至于第二个问题，我们手动确定SL MT在缺乏背景下无法产生足够翻译的主要领域。然后，我们创建一个评估集，其中这些现象被注释以减轻对DL系统的自动评估。我们使用我们的语料库训练MT模型来演示如何使用上下文导致改进。

Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT, which is difficult to 1) train with little amount of DL data; and 2) evaluate, as the main methods and data sets focus on SL evaluation. To address the first issue, we present a document-aligned Japanese-English conversation corpus, including balanced, high-quality business conversation data for tuning and testing. As for the second issue, we manually identify the main areas where SL MT fails to produce adequate translations in lack of context. We then create an evaluation set where these phenomena are annotated to alleviate automatic evaluation of DL systems. We train MT models using our corpus to demonstrate how using context leads to improvements.

下载PDF全文

下载文献需遵守相关版权规定

论文标题