论文标题
改善图像分类中的偏见的评估
Improving Evaluation of Debiasing in Image Classification
论文作者
论文摘要
图像分类器通常过于依赖于与目标类别(即数据集偏差)有很强相关性的外围属性。由于数据集偏差,该模型正确分类了包括偏差属性(即偏置与偏置样本)的数据样本,同时无法正确预测没有偏置属性的那些(即偏置冲突样本)。最近,无数的研究着重于减轻此类数据集偏见,其任务被称为偏见。但是,我们的综合研究表明,在对图像分类中进行看法时,需要改善几个问题。首先,以前的大多数研究都没有指定他们如何选择其超参数和模型检查点(即调整标准)。其次,到目前为止,该研究的研究评估了他们在偏见过高的数据集上提出的方法,显示出偏差严重程度低的数据集上的性能下降。第三,辩护研究并不具有一致的实验设置(例如数据集和神经网络),需要进行标准化才能进行公平比较。基于此类问题,本文1)提出了调整标准的评估度量“对齐相关(AC)得分”,2)包括具有低偏差严重程度的实验环境,并表明它们尚未探索,并且3)统一标准化的实验环境,以促进Demiasing方法之间的公平比较。我们认为,我们的发现和课程激发了未来的研究人员,以进行公平的比较,进一步推动最先进的表演。
Image classifiers often rely overly on peripheral attributes that have a strong correlation with the target class (i.e., dataset bias) when making predictions. Due to the dataset bias, the model correctly classifies data samples including bias attributes (i.e., bias-aligned samples) while failing to correctly predict those without bias attributes (i.e., bias-conflicting samples). Recently, a myriad of studies focus on mitigating such dataset bias, the task of which is referred to as debiasing. However, our comprehensive study indicates several issues need to be improved when conducting evaluation of debiasing in image classification. First, most of the previous studies do not specify how they select their hyper-parameters and model checkpoints (i.e., tuning criterion). Second, the debiasing studies until now evaluated their proposed methods on datasets with excessively high bias-severities, showing degraded performance on datasets with low bias severity. Third, the debiasing studies do not share consistent experimental settings (e.g., datasets and neural networks) which need to be standardized for fair comparisons. Based on such issues, this paper 1) proposes an evaluation metric `Align-Conflict (AC) score' for the tuning criterion, 2) includes experimental settings with low bias severity and shows that they are yet to be explored, and 3) unifies the standardized experimental settings to promote fair comparisons between debiasing methods. We believe that our findings and lessons inspire future researchers in debiasing to further push state-of-the-art performances with fair comparisons.