论文标题
有用的置信度措施:超出最高分数
Useful Confidence Measures: Beyond the Max Score
论文作者
论文摘要
在安全性批评应用中部署机器学习(ML)的重要组成部分是对ML模型的预测具有可靠的信心。对于分类器$ f $,在候选类别上产生概率向量$ f(x)$,通常认为置信度为$ \ max_i f(x)_i $。这种方法可能受到限制,因为它无视概率向量的其余部分。在这项工作中,我们得出了几项置信度度量,这些置信度依赖于最高分数之外的信息,例如基于保证金和基于熵的度量,并经验评估其有用性,重点是具有分布变化和基于变压器的模型的NLP任务。我们表明,当在分布式数据``开箱即用''上评估模型时,仅使用最大分数来告知置信度度量是高度最佳的。在后处理制度(可以使用额外的分发数据可以提高$ f $的分数)中,尽管不那么重要,但这仍然是正确的。总体而言,我们的结果表明,基于熵的信心是一个令人惊讶的有用措施。
An important component in deploying machine learning (ML) in safety-critic applications is having a reliable measure of confidence in the ML model's predictions. For a classifier $f$ producing a probability vector $f(x)$ over the candidate classes, the confidence is typically taken to be $\max_i f(x)_i$. This approach is potentially limited, as it disregards the rest of the probability vector. In this work, we derive several confidence measures that depend on information beyond the maximum score, such as margin-based and entropy-based measures, and empirically evaluate their usefulness, focusing on NLP tasks with distribution shifts and Transformer-based models. We show that when models are evaluated on the out-of-distribution data ``out of the box'', using only the maximum score to inform the confidence measure is highly suboptimal. In the post-processing regime (where the scores of $f$ can be improved using additional in-distribution held-out data), this remains true, albeit less significant. Overall, our results suggest that entropy-based confidence is a surprisingly useful measure.