PMINDIA-印度语言平行语料库的集合

论文标题

PMINDIA-印度语言平行语料库的集合

PMIndia -- A Collection of Parallel Corpora of Languages of India

论文作者

Haddow, Barry, Kirefu, Faheem

论文摘要

构建高质量的机器翻译（MT）系统以及其他多语言NLP应用程序所需的并行文本。对于许多南亚语言，此类数据供应不足。在本文中，我们描述了一种新的公开语料库（PMINDIA），该语料库由平行句子组成，该句子将13种印度主要语言与英语配对。该语料库最多包含每种语言对的56000个句子。我们解释了如何构建语料库，包括对两种不同的自动句子对准方法的评估，并在语料库上介绍了一些初始的NMT结果。

Parallel text is required for building high-quality machine translation (MT) systems, as well as for other multilingual NLP applications. For many South Asian languages, such data is in short supply. In this paper, we described a new publicly available corpus (PMIndia) consisting of parallel sentences which pair 13 major languages of India with English. The corpus includes up to 56000 sentences for each language pair. We explain how the corpus was constructed, including an assessment of two different automatic sentence alignment methods, and present some initial NMT results on the corpus.

下载PDF全文

下载文献需遵守相关版权规定

论文标题