论文标题
可扩展的罚款回归用于嘈杂标签学习中的噪声检测
Scalable Penalized Regression for Noise Detection in Learning with Noisy Labels
论文作者
论文摘要
嘈杂的训练集通常会导致神经网络的概括和鲁棒性降解。在本文中,我们建议使用理论上保证的嘈杂标签检测框架来检测和删除使用嘈杂标签(LNL)学习的嘈杂数据。具体而言,我们设计了一个惩罚回归,以模拟网络特征和一hot标签之间的线性关系,其中噪声数据由回归模型中求解的非零平均移位参数识别。为了使该框架可扩展到包含大量类别和培训数据的数据集,我们建议使用拆分算法将整个训练集划分为小块,这些训练集可以通过并行的惩罚回归来解决,从而导致可伸缩的惩罚回归(SPR)框架。我们为SPR提供了非反应概率条件,以正确识别嘈杂的数据。虽然可以将SPR视为标准监督培训管道的样本选择模块,但我们将其与半监督算法相结合,以进一步利用嘈杂数据的支持作为未标记的数据。在几个基准数据集和现实世界嘈杂数据集上的实验结果显示了我们框架的有效性。我们的代码和预算模型在https://github.com/yikai-wang/spr-lnl上发布。
Noisy training set usually leads to the degradation of generalization and robustness of neural networks. In this paper, we propose using a theoretically guaranteed noisy label detection framework to detect and remove noisy data for Learning with Noisy Labels (LNL). Specifically, we design a penalized regression to model the linear relation between network features and one-hot labels, where the noisy data are identified by the non-zero mean shift parameters solved in the regression model. To make the framework scalable to datasets that contain a large number of categories and training data, we propose a split algorithm to divide the whole training set into small pieces that can be solved by the penalized regression in parallel, leading to the Scalable Penalized Regression (SPR) framework. We provide the non-asymptotic probabilistic condition for SPR to correctly identify the noisy data. While SPR can be regarded as a sample selection module for standard supervised training pipeline, we further combine it with semi-supervised algorithm to further exploit the support of noisy data as unlabeled data. Experimental results on several benchmark datasets and real-world noisy datasets show the effectiveness of our framework. Our code and pretrained models are released at https://github.com/Yikai-Wang/SPR-LNL.