论文标题

Phinc:平行的Hinglish社交媒体代码混合语料库,用于机器翻译

PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation

论文作者

Srivastava, Vivek, Singh, Mayank

论文摘要

混音是在句子中使用多种语言的现象。这是社交媒体平台上经常观察到的沟通模式。在一条短信中使用多种语言的灵活性可能有助于与目标受众有效沟通。但是,它增加了处理和理解自然语言在更大程度上的挑战。本文介绍了13,738个代码混合的英文印度语句子及其相应翻译的平行语料库。句子的翻译是由注释者手动完成的。我们正在释放并行语料库,以促进代码混合机器翻译中的未来研究机会。注释的语料库可从https://doi.org/10.5281/zenodo.3605597获得。

Code-mixing is the phenomenon of using more than one language in a sentence. It is a very frequently observed pattern of communication on social media platforms. Flexibility to use multiple languages in one text message might help to communicate efficiently with the target audience. But, it adds to the challenge of processing and understanding natural language to a much larger extent. This paper presents a parallel corpus of the 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. The translations of sentences are done manually by the annotators. We are releasing the parallel corpus to facilitate future research opportunities in code-mixed machine translation. The annotated corpus is available at https://doi.org/10.5281/zenodo.3605597.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源