论文标题

重新检查校准:问题的情况

Re-Examining Calibration: The Case of Question Answering

论文作者

Si, Chenglei, Zhao, Chen, Min, Sewon, Boyd-Graber, Jordan

论文摘要

为了使用户信任模型预测,他们需要了解模型输出,尤其是其信心 - 校准旨在调整(校准)模型的信心以匹配预期的准确性。我们认为,传统的校准评估不会促进有效的校准:例如,它可以鼓励始终为所有预测分配平庸的置信度得分,这无助于用户将正确的预测与错误的预测区分开。在这些观察结果的基础上,我们提出了一个新的校准度量标准Macroce,可以更好地捕获该模型是否对错误的预测和高信心分配了较低的信心,以纠正预测。为了关注开放域问题回答的实际应用,我们研究了应用于广泛使用的猎犬阅读器管道上的常规校准方法,所有这些方法都不会在我们的新宏观指标下带来显着的收益。为了更好地校准,我们提出了一种新的校准方法(CONSCAL),该方法不仅使用最终模型预测,还使用多个模型检查点是否可以进行一致的预测。总的来说,我们提供了校准的替代视图,以及对我们的度量标准的新度量,重新评估现有的校准方法,并提出了更有效的校准方法的建议。

For users to trust model predictions, they need to understand model outputs, particularly their confidence - calibration aims to adjust (calibrate) models' confidence to match expected accuracy. We argue that the traditional calibration evaluation does not promote effective calibrations: for example, it can encourage always assigning a mediocre confidence score to all predictions, which does not help users distinguish correct predictions from wrong ones. Building on those observations, we propose a new calibration metric, MacroCE, that better captures whether the model assigns low confidence to wrong predictions and high confidence to correct predictions. Focusing on the practical application of open-domain question answering, we examine conventional calibration methods applied on the widely-used retriever-reader pipeline, all of which do not bring significant gains under our new MacroCE metric. Toward better calibration, we propose a new calibration method (ConsCal) that uses not just final model predictions but whether multiple model checkpoints make consistent predictions. Altogether, we provide an alternative view of calibration along with a new metric, re-evaluation of existing calibration methods on our metric, and proposal of a more effective calibration method.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源