论文标题
以数据为中心的调试:通过目标数据收集来减轻模型失败
Data-Centric Debugging: mitigating model failures via targeted data collection
论文作者
论文摘要
当训练集不能充分覆盖其部署的所有设置时,深层神经网络在现实世界中可能是不可靠的。关注图像分类,我们考虑具有错误分布$ \ Mathcal {e} $的设置,代表模型失败的部署方案。我们可以从$ \ Mathcal {e} $访问一小部分样本$ \ Mathcal {e} _ {sample} $,获取其他样本可能很昂贵。在传统的模型开发框架中,减轻$ \ Mathcal {e} $中模型的故障可能具有挑战性,并且通常以临时方式进行。在本文中,我们提出了一种用于模型调试的通用方法,该方法可以从$ \ Mathcal {e} $上系统地改善模型性能,同时保持其在原始测试集上的性能。我们的关键假设是,我们可以访问大量弱(噪声)标记的数据$ \ Mathcal {F} $。但是,由于较大的标签噪声,将$ \ Mathcal {F} $添加到训练中会损害模型性能。我们的以数据为中心的调试(DCD)框架仔细地创建了一个debug-Train,通过从$ \ Mathcal {f} $中选择与$ \ Mathcal {e} _ {sample} $中图像相似的图像。为此,我们在特征空间(倒数第二层激活)中使用$ \ ell_2 $距离,包括Resnet,Robust Resnet和Dino,我们观察到与Resnets相比,发现Dino Vits在发现相似的图像方面要好得多。与LPIP相比,我们发现我们的方法将计算和存储要求降低了99.58 \%。与在测试集上保持模型性能的基线相比,我们在调试式设置中取得了显着(+9.45 \%)的显着提高的结果。
Deep neural networks can be unreliable in the real world when the training set does not adequately cover all the settings where they are deployed. Focusing on image classification, we consider the setting where we have an error distribution $\mathcal{E}$ representing a deployment scenario where the model fails. We have access to a small set of samples $\mathcal{E}_{sample}$ from $\mathcal{E}$ and it can be expensive to obtain additional samples. In the traditional model development framework, mitigating failures of the model in $\mathcal{E}$ can be challenging and is often done in an ad hoc manner. In this paper, we propose a general methodology for model debugging that can systemically improve model performance on $\mathcal{E}$ while maintaining its performance on the original test set. Our key assumption is that we have access to a large pool of weakly (noisily) labeled data $\mathcal{F}$. However, naively adding $\mathcal{F}$ to the training would hurt model performance due to the large extent of label noise. Our Data-Centric Debugging (DCD) framework carefully creates a debug-train set by selecting images from $\mathcal{F}$ that are perceptually similar to the images in $\mathcal{E}_{sample}$. To do this, we use the $\ell_2$ distance in the feature space (penultimate layer activations) of various models including ResNet, Robust ResNet and DINO where we observe DINO ViTs are significantly better at discovering similar images compared to Resnets. Compared to LPIPS, we find that our method reduces compute and storage requirements by 99.58\%. Compared to the baselines that maintain model performance on the test set, we achieve significantly (+9.45\%) improved results on the debug-heldout sets.