Tomayto，Tomahto。除了令牌级别的答案等效性，以回答评估

论文标题

Tomayto，Tomahto。除了令牌级别的答案等效性，以回答评估

Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation

论文作者

Bulian, Jannis, Buck, Christian, Gajewski, Wojciech, Boerschinger, Benjamin, Schuster, Tal

论文摘要

通常，针对一个或多个答案的手动注释有限集评估了问题答案（QA）系统的预测。这导致了覆盖范围的限制，从而低估了系统的真实性能，并且通常通过使用预定义的规则或代币级别的F1度量来扩展精确匹配（EM）来解决。在本文中，我们提出了第一个系统的概念和数据驱动分析，以检查令牌级别等效度量的缺点。为此，我们定义了答案等效性的不对称概念（AE），接受等效或改进参考的答案，并发布超过23K人类的判断，这些判断是针对小队多个QA系统产生的候选人的23K人类判断。通过对这些数据的仔细分析，我们揭示和量化了F1度量的几个具体局限性，例如渐进性的错误印象或对问题的依赖性依赖。由于为每个评估的模型收集AE注释很昂贵，因此我们学习了一个BERT匹配（BEM）量度以近似此任务。作为一项比质量保证更简单的任务，我们发现BEM比F1提供了明显更好的AE近似值，并更准确地反映了系统的性能。最后，我们证明了AE和BEM对最小准确预测集的具体应用的实际实用性，从而将所需答案的数量减少到最高x2.6。

The predictions of question answering (QA)systems are typically evaluated against manually annotated finite sets of one or more answers. This leads to a coverage limitation that results in underestimating the true performance of systems, and is typically addressed by extending over exact match (EM) with pre-defined rules or with the token-level F1 measure. In this paper, we present the first systematic conceptual and data-driven analysis to examine the shortcomings of token-level equivalence measures. To this end, we define the asymmetric notion of answer equivalence (AE), accepting answers that are equivalent to or improve over the reference, and publish over 23k human judgments for candidates produced by multiple QA systems on SQuAD. Through a careful analysis of this data, we reveal and quantify several concrete limitations of the F1 measure, such as a false impression of graduality, or missing dependence on the question. Since collecting AE annotations for each evaluated model is expensive, we learn a BERT matching (BEM) measure to approximate this task. Being a simpler task than QA, we find BEM to provide significantly better AE approximations than F1, and to more accurately reflect the performance of systems. Finally, we demonstrate the practical utility of AE and BEM on the concrete application of minimal accurate prediction sets, reducing the number of required answers by up to x2.6.

下载PDF全文

下载文献需遵守相关版权规定

论文标题