论文标题

无监督的评估针对变形金刚回答的问题

Unsupervised Evaluation for Question Answering with Transformers

论文作者

Muttenthaler, Lukas, Augenstein, Isabelle, Bjerva, Johannes

论文摘要

在推理时自动评估质量检查模型的答案是一个挑战。尽管许多模型都提供了置信度得分,而简单的启发式方法可以大大有助于指示答案的正确性,但此类措施依赖于数据集,并且不太可能概括。在这项工作中,我们首先研究了基于变压器架构的问题,答案和上下文的隐藏表示形式。我们在答案表示中观察到一致的模式,我们证明可以自动评估预测的答案跨度是否正确。我们的方法不需要任何标记的数据,并且在2个数据集和7个域上胜过强启发式基线。我们能够以91.37%的精度在小队中预测模型的答案是否正确,而SubJQA的精度为80.7%。我们希望该方法将具有广泛的应用程序,例如,在QA数据集的半自动开发中

It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalize. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labeled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model's answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in the semi-automatic development of QA datasets

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源