论文标题
缓解校准误差估计中的偏差
Mitigating Bias in Calibration Error Estimation
论文作者
论文摘要
为了使AI系统可靠,它对决策的信心必须与其准确性相匹配。为了评估匹配程度,通常会通过置信度进行示例,并比较每箱平均置信度和准确性。校准中的大多数研究都集中在减少校准误差的经验度量的技术上,ECE_BIN。相反,我们专注于评估这种经验措施的统计偏见,并确定更好的估计器。我们提出了一个框架,通过该框架可以计算给定尺寸的评估数据集的特定估计器的偏差。该框架涉及合成模型输出,这些模型输出与流行数据集的常见神经体系结构具有相同的统计信息。我们发现,基于bin量相等的垃圾箱(实例数)的基于binning的估计器的偏差低于宽度相等的估计器。我们的结果表明,两个可靠的校准率估计器:伪造估计器(Brocker,2012; Ferro和Fricker,2012),我们提出的一种方法ECE_SWEEP,它使用相等的垃圾箱,并选择在校准功能中保留单调性的同时,使用等量的垃圾箱数量。使用这些估计量,我们观察到重新校准方法的有效性以及模型误核的检测方面有所提高。
For an AI system to be reliable, the confidence it expresses in its decisions must match its accuracy. To assess the degree of match, examples are typically binned by confidence and the per-bin mean confidence and accuracy are compared. Most research in calibration focuses on techniques to reduce this empirical measure of calibration error, ECE_bin. We instead focus on assessing statistical bias in this empirical measure, and we identify better estimators. We propose a framework through which we can compute the bias of a particular estimator for an evaluation data set of a given size. The framework involves synthesizing model outputs that have the same statistics as common neural architectures on popular data sets. We find that binning-based estimators with bins of equal mass (number of instances) have lower bias than estimators with bins of equal width. Our results indicate two reliable calibration-error estimators: the debiased estimator (Brocker, 2012; Ferro and Fricker, 2012) and a method we propose, ECE_sweep, which uses equal-mass bins and chooses the number of bins to be as large as possible while preserving monotonicity in the calibration function. With these estimators, we observe improvements in the effectiveness of recalibration methods and in the detection of model miscalibration.