论文标题
OCNLI:原始中文自然推断
OCNLI: Original Chinese Natural Language Inference
论文作者
论文摘要
尽管最近的自然语言推断(NLI)取得了巨大进展,但主要由对新数据集的大规模投资(例如SNLI,MNLI)进行大规模投资和建模的进步所驱动,但由于缺乏在世界语言中缺乏可靠的数据集,大多数进度都限于英语。在本文中,我们介绍了第一个大规模的NLI数据集(由约56,000个注释的句子对组成),称为中文原始的中文自然语言推论数据集(OCNLI)。与最近将NLI扩展到其他语言的尝试不同,我们的数据集不依赖任何自动翻译或非专家注释。取而代之的是,我们从专门从事语言学的母语者那里征询注释。我们密切遵循用于MNLI的注释协议,但创造了新的策略来引发多种假设。我们使用最先进的中文预培训模型在数据集上建立了几个基线结果,甚至发现最佳性能模型也超过了人类绩效(约12%的绝对性能差距),这使其成为一种充满挑战的新资源,我们希望这将有助于加速中国NLU的进度。据我们所知,这是第一个使用非英语语言的人类引人入胜的MNLI风格语料库。
Despite the tremendous recent progress on natural language inference (NLI), driven largely by large-scale investment in new datasets (e.g., SNLI, MNLI) and advances in modeling, most progress has been limited to English due to a lack of reliable datasets for most of the world's languages. In this paper, we present the first large-scale NLI dataset (consisting of ~56,000 annotated sentence pairs) for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI). Unlike recent attempts at extending NLI to other languages, our dataset does not rely on any automatic translation or non-expert annotation. Instead, we elicit annotations from native speakers specializing in linguistics. We follow closely the annotation protocol used for MNLI, but create new strategies for eliciting diverse hypotheses. We establish several baseline results on our dataset using state-of-the-art pre-trained models for Chinese, and find even the best performing models to be far outpaced by human performance (~12% absolute performance gap), making it a challenging new resource that we hope will help to accelerate progress in Chinese NLU. To the best of our knowledge, this is the first human-elicited MNLI-style corpus for a non-English language.