无监督的评估针对变形金刚回答的问题

论文标题

无监督的评估针对变形金刚回答的问题

Unsupervised Evaluation for Question Answering with Transformers

论文作者

Muttenthaler, Lukas, Augenstein, Isabelle, Bjerva, Johannes

论文摘要

在推理时自动评估质量检查模型的答案是一个挑战。尽管许多模型都提供了置信度得分，而简单的启发式方法可以大大有助于指示答案的正确性，但此类措施依赖于数据集，并且不太可能概括。在这项工作中，我们首先研究了基于变压器架构的问题，答案和上下文的隐藏表示形式。我们在答案表示中观察到一致的模式，我们证明可以自动评估预测的答案跨度是否正确。我们的方法不需要任何标记的数据，并且在2个数据集和7个域上胜过强启发式基线。我们能够以91.37％的精度在小队中预测模型的答案是否正确，而SubJQA的精度为80.7％。我们希望该方法将具有广泛的应用程序，例如，在QA数据集的半自动开发中

It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalize. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labeled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model's answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in the semi-automatic development of QA datasets

下载PDF全文

下载文献需遵守相关版权规定

论文标题