论文标题
医学图像分类中的故障检测:现实检查和测试测试床
Failure Detection in Medical Image Classification: A Reality Check and Benchmarking Testbed
论文作者
论文摘要
自动图像分类中的故障检测是临床部署的关键保障。检测到的故障案例可以转介给人类评估,以确保在计算机辅助临床决策中的患者安全。尽管其重要性至关重要,但没有足够的证据表明在医学成像的背景下,最先进的信心评分方法检测分类模型的测试时间失败的能力。本文提供了一个现实检查,确定了内域错误分类检测方法的性能,在6个医学成像数据集上进行了9种具有不同成像模式的医学成像数据集,在多类和二进制分类设置中进行了基准测试。我们的实验表明,故障检测问题远非解决。我们发现,计算机视觉和机器学习文献中提出的基准的高级方法都无法始终超越简单的软磁基线,这表明改善了分布外检测或模型校准不一定会转化为改进的域中错误分类检测。我们发达的测试床有助于这一重要领域的未来工作
Failure detection in automated image classification is a critical safeguard for clinical deployment. Detected failure cases can be referred to human assessment, ensuring patient safety in computer-aided clinical decision making. Despite its paramount importance, there is insufficient evidence about the ability of state-of-the-art confidence scoring methods to detect test-time failures of classification models in the context of medical imaging. This paper provides a reality check, establishing the performance of in-domain misclassification detection methods, benchmarking 9 widely used confidence scores on 6 medical imaging datasets with different imaging modalities, in multiclass and binary classification settings. Our experiments show that the problem of failure detection is far from being solved. We found that none of the benchmarked advanced methods proposed in the computer vision and machine learning literature can consistently outperform a simple softmax baseline, demonstrating that improved out-of-distribution detection or model calibration do not necessarily translate to improved in-domain misclassification detection. Our developed testbed facilitates future work in this important area