土耳其自然语言推断的数据和表示

论文标题

土耳其自然语言推断的数据和表示

Data and Representation for Turkish Natural Language Inference

论文作者

Budur, Emrah, Özçelik, Rıza, Güngör, Tunga, Potts, Christopher

论文摘要

NLP中的大型注释数据集用英语绝大多数。这是其他语言进步的障碍。不幸的是，在每种语言中为每种任务获得新的注释资源都是非常昂贵的。同时，商用机器翻译系统现在是强大的。我们可以利用这些系统自动翻译英语数据集吗？在本文中，我们对土耳其语中的自然语言推论（NLI）提供了积极的回应。我们将两个大型英语NLI数据集翻译成土耳其，并让一组专家验证了他们的翻译质量和忠诚度为原始标签。使用这些数据集，我们解决了土耳其NLI代表性的核心问题。我们发现，语言嵌入至关重要，并且可以在训练集很大的情况下避免形态解析。最后，我们表明在我们的机器翻译数据集上训练的模型在人类翻译评估集上成功。我们公开共享所有代码，模型和数据。

Large annotated datasets in NLP are overwhelmingly in English. This is an obstacle to progress in other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. At the same time, commercial machine translation systems are now robust. Can we leverage these systems to translate English-language datasets automatically? In this paper, we offer a positive response for natural language inference (NLI) in Turkish. We translated two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels. Using these datasets, we address core issues of representation for Turkish NLI. We find that in-language embeddings are essential and that morphological parsing can be avoided where the training set is large. Finally, we show that models trained on our machine-translated datasets are successful on human-translated evaluation sets. We share all code, models, and data publicly.

下载PDF全文

下载文献需遵守相关版权规定

论文标题