论文标题
F1还不够!模型和评估以用户为中心的可解释问题回答
F1 is Not Enough! Models and Evaluation Towards User-Centered Explainable Question Answering
论文作者
论文摘要
可解释的问题答案系统可以预测答案,并提供解释,以说明为什么选择了答案。目标是使用户能够评估系统的正确性并了解其推理过程。但是,我们表明当前的模型和评估设置在答案和解释的耦合方面存在缺点,这可能会导致用户体验严重问题。作为一种补救措施,我们提出了一个分层模型和一个新的正则化项,以加强答案解释耦合以及两个评估得分以量化耦合。我们在HOTPOTQA基准数据集上进行实验并进行用户研究。用户研究表明,我们的模型提高了用户判断系统正确性的能力,并且像F1这样的分数不足以估算模型在与人类用户的实用环境中的有用性。我们的分数更好地与用户体验保持一致,使他们有望进行模型选择。
Explainable question answering systems predict an answer together with an explanation showing why the answer has been selected. The goal is to enable users to assess the correctness of the system and understand its reasoning process. However, we show that current models and evaluation settings have shortcomings regarding the coupling of answer and explanation which might cause serious issues in user experience. As a remedy, we propose a hierarchical model and a new regularization term to strengthen the answer-explanation coupling as well as two evaluation scores to quantify the coupling. We conduct experiments on the HOTPOTQA benchmark data set and perform a user study. The user study shows that our models increase the ability of the users to judge the correctness of the system and that scores like F1 are not enough to estimate the usefulness of a model in a practical setting with human users. Our scores are better aligned with user experience, making them promising candidates for model selection.