评估机器阅读理解金标准的框架

论文标题

评估机器阅读理解金标准的框架

A Framework for Evaluation of Machine Reading Comprehension Gold Standards

论文作者

Schlegel, Viktor, Valentino, Marco, Freitas, André, Nenadic, Goran, Batista-Navarro, Riza

论文摘要

机器阅读理解（MRC）是在文本段落上回答问题的任务。尽管神经MRC系统获得了知名度并取得明显的性能，但使用用于建立其性能的方法正在提出问题，尤其是关于用于评估它们的黄金标准的数据设计。对此数据中存在的挑战只有有限的理解，这使得很难进行比较并提出可靠的假设。作为缓解问题的第一步，本文提出了一个统一的框架，以系统地研究当前的语言特征，一方面所需的推理和背景知识和事实正确性，以及词汇提示的存在，作为对理解的要求下降。我们为后者的第一个和一组近似指标提出了定性注释模式。在该框架的第一个应用中，我们分析了现代MRC金标准，并介绍了我们的发现：缺乏有助于词汇歧义的特征，预期答案的事实正确性以及词汇提示的存在，所有这些都可能降低了评估数据的阅读理解复杂性和质量。

Machine Reading Comprehension (MRC) is the task of answering a question over a paragraph of text. While neural MRC systems gain popularity and achieve noticeable performance, issues are being raised with the methodology used to establish their performance, particularly concerning the data design of gold standards that are used to evaluate them. There is but a limited understanding of the challenges present in this data, which makes it hard to draw comparisons and formulate reliable hypotheses. As a first step towards alleviating the problem, this paper proposes a unifying framework to systematically investigate the present linguistic features, required reasoning and background knowledge and factual correctness on one hand, and the presence of lexical cues as a lower bound for the requirement of understanding on the other hand. We propose a qualitative annotation schema for the first and a set of approximative metrics for the latter. In a first application of the framework, we analyse modern MRC gold standards and present our findings: the absence of features that contribute towards lexical ambiguity, the varying factual correctness of the expected answers and the presence of lexical cues, all of which potentially lower the reading comprehension complexity and quality of the evaluation data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题