论文标题
利用多语言新闻网站建立库尔德平行语料库
Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus
论文作者
论文摘要
机器翻译一直是自然语言处理中发展的主要动力。尽管由于深度学习方法,在创建更有效的机器翻译系统方面取得了蓬勃发展的成就,但平行语料库对于该领域的进步仍然是必不可少的。为了创建库尔德语言的平行语料库,我们在本文中描述了我们从多语言网站检索潜在可调的新闻文章的方法,并根据词汇相似性和脚本翻译的方言和语言手动对齐它们。我们提出了一个在库尔德,索拉尼和库尔曼吉的两个主要方言中包含12,327对的语料库。我们还提供1,797和650个翻译对,英语kurmanji和English-Sorani。该语料库是根据CC BY-NC-SA 4.0许可证公开获得的。
Machine translation has been a major motivation of development in natural language processing. Despite the burgeoning achievements in creating more efficient machine translation systems thanks to deep learning methods, parallel corpora have remained indispensable for progress in the field. In an attempt to create parallel corpora for the Kurdish language, in this paper, we describe our approach in retrieving potentially-alignable news articles from multi-language websites and manually align them across dialects and languages based on lexical similarity and transliteration of scripts. We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji. We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani. The corpus is publicly available under the CC BY-NC-SA 4.0 license.