针对特定领域特定的Sinhala-English-Tamil统计机器的数据增强和术语集成

论文标题

针对特定领域特定的Sinhala-English-Tamil统计机器的数据增强和术语集成

Data Augmentation and Terminology Integration for Domain-Specific Sinhala-English-Tamil Statistical Machine Translation

论文作者

Fernando, Aloka, Ranathunga, Surangika, Dias, Gihan

论文摘要

在低资源语言的机器翻译（MT）的背景下，词汇（OOV）是一个问题。当来源和/或目标语言在形态上富裕时，它会变得更糟。双语列表集成是解决OOV问题的一种方法。与培训数据相比，这允许翻译更多的单词。但是，由于双语列表包含基本形式的单词，因此它不会翻译出诸如僧伽罗和泰米尔语之类的形态丰富语言的易转形式。本文着重于数据增强技术，其中双语词典术语是根据病例标记扩展的，目的是生成新单词，用于统计机器翻译（SMT）。该词典术语的数据增强技术显示了Sinhala-English SMT的BLEU得分的提高。

Out of vocabulary (OOV) is a problem in the context of Machine Translation (MT) in low-resourced languages. When source and/or target languages are morphologically rich, it becomes even worse. Bilingual list integration is an approach to address the OOV problem. This allows more words to be translated than are in the training data. However, since bilingual lists contain words in the base form, it will not translate inflected forms for morphologically rich languages such as Sinhala and Tamil. This paper focuses on data augmentation techniques where bilingual lexicon terms are expanded based on case-markers with the objective of generating new words, to be used in Statistical machine Translation (SMT). This data augmentation technique for dictionary terms shows improved BLEU scores for Sinhala-English SMT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题