论文标题
Imagenet-X:理解带有变异因子的模型错误
ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations
论文作者
论文摘要
深度学习视觉系统被广泛部署在可靠性至关重要的应用程序中。但是,即使在当今的最佳模型,当对象的姿势,照明或背景都会有所不同。尽管现有的基准表面示例对模型有挑战,但它们并不能解释为什么会出现此类错误。为了满足这种需求,我们介绍了Imagenet-X,这是一组16个人类注释,包括姿势,背景或照明整个Imagenet-1K验证集以及12K训练图像的随机子集。我们配备了Imagenet-X,我们研究了2,200个当前识别模型,并研究了错误的类型,这是模型(1)架构的函数,例如变压器与卷积,(2)学习范式,例如受监督与自我监督,以及(3)培训程序,例如数据增强。无论这些选择如何,我们都会发现模型在Imagenet-X类别中具有一致的故障模式。我们还发现,尽管数据扩大可以改善对某些因素的鲁棒性,但它们会引起其他因素的溢出效应。例如,强烈的随机裁剪会损害较小物体的鲁棒性。这些见解共同提高了现代视觉模型的鲁棒性,未来的研究应着重于收集其他数据并了解数据增强方案。与这些见解一起,我们释放了一个基于Imagenet-X的工具包,以刺激对错误识别系统的错误进行进一步研究。
Deep learning vision systems are widely deployed across applications where reliability is critical. However, even today's best models can fail to recognize an object when its pose, lighting, or background varies. While existing benchmarks surface examples challenging for models, they do not explain why such mistakes arise. To address this need, we introduce ImageNet-X, a set of sixteen human annotations of factors such as pose, background, or lighting the entire ImageNet-1k validation set as well as a random subset of 12k training images. Equipped with ImageNet-X, we investigate 2,200 current recognition models and study the types of mistakes as a function of model's (1) architecture, e.g. transformer vs. convolutional, (2) learning paradigm, e.g. supervised vs. self-supervised, and (3) training procedures, e.g., data augmentation. Regardless of these choices, we find models have consistent failure modes across ImageNet-X categories. We also find that while data augmentation can improve robustness to certain factors, they induce spill-over effects to other factors. For example, strong random cropping hurts robustness on smaller objects. Together, these insights suggest to advance the robustness of modern vision models, future research should focus on collecting additional data and understanding data augmentation schemes. Along with these insights, we release a toolkit based on ImageNet-X to spur further study into the mistakes image recognition systems make.