来自代码混合对话的自然语言推断的新数据集

论文标题

来自代码混合对话的自然语言推断的新数据集

A New Dataset for Natural Language Inference from Code-mixed Conversations

论文作者

Khanuja, Simran, Dandapat, Sandipan, Sitaram, Sunayana, Choudhury, Monojit

论文摘要

自然语言推论（NLI）是在前提和假设之间推断逻辑关系（通常是构成或矛盾）的任务。混音是在同一对话或话语中使用多种语言，并且在世界各地的多语言社区中都普遍存在。在本文中，我们介绍了第一个用于混合的NLI的数据集，其中的场所和假设都在代码混合的印地语英语中。我们将印地语电影（宝莱坞）的数据用作前提，以及印度英语双语者的众包假设。我们进行了一项试点注释研究，并根据试点的观察结果描述了最终注释方案。当前，收集的数据由400个房屋组成，形式是代码混合的对话片段和2240个代码混合假设。我们进行了广泛的分析，以推断在获得的数据集中通常观察到的语言现象。我们使用基于Mbert的标准管道来评估数据集的NLI并报告结果。

Natural Language Inference (NLI) is the task of inferring the logical relationship, typically entailment or contradiction, between a premise and hypothesis. Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world. In this paper, we present the first dataset for code-mixed NLI, in which both the premises and hypotheses are in code-mixed Hindi-English. We use data from Hindi movies (Bollywood) as premises, and crowd-source hypotheses from Hindi-English bilinguals. We conduct a pilot annotation study and describe the final annotation protocol based on observations from the pilot. Currently, the data collected consists of 400 premises in the form of code-mixed conversation snippets and 2240 code-mixed hypotheses. We conduct an extensive analysis to infer the linguistic phenomena commonly observed in the dataset obtained. We evaluate the dataset using a standard mBERT-based pipeline for NLI and report results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题