论文标题
识别具有触发式半监督的噪声标签
Identifying noisy labels with a transductive semi-supervised leave-one-out filter
论文作者
论文摘要
使用有意义的标签获得数据通常是昂贵且容易出错的。在这种情况下,半监督学习(SSL)的方法很有趣,因为它们利用对未标记数据的假设来弥补有限的标签。但是,在实际情况下,我们不能假设标记过程是无误的,并且在存在标签噪声的情况下,许多SSL分类器的准确性大大降低。在这项工作中,我们介绍了LGC_LVOF,这是一种基于本地和全局一致性(LGC)算法的保留的过滤方法。我们的方法旨在检测和删除错误的标签,因此可以用作任何SSL分类器的预处理步骤。鉴于传播矩阵,每步检测嘈杂的标签为O(Cl),c类和l标记的数量。此外,一个人不需要计算整个繁殖矩阵,而只需$ l $ by $ l $ subbatrix,与标记实例之间的交互相对应。结果,我们的方法最适合具有大量未标记数据但标签不多的数据集。提供了许多数据集的结果,包括MNIST和ISLATET。 LGCLVOF似乎比基于梯度的过滤器同样或更精确。我们表明,将LGCLVOF嵌入LGC的最佳计算准确性可与$ \ ell_1 $ - 基于基于$ \ ell_1 $的分类器的最佳案例相媲美,旨在为标记噪声标记。我们提供一个启发式方法来选择删除实例的数量。
Obtaining data with meaningful labels is often costly and error-prone. In this situation, semi-supervised learning (SSL) approaches are interesting, as they leverage assumptions about the unlabeled data to make up for the limited amount of labels. However, in real-world situations, we cannot assume that the labeling process is infallible, and the accuracy of many SSL classifiers decreases significantly in the presence of label noise. In this work, we introduce the LGC_LVOF, a leave-one-out filtering approach based on the Local and Global Consistency (LGC) algorithm. Our method aims to detect and remove wrong labels, and thus can be used as a preprocessing step to any SSL classifier. Given the propagation matrix, detecting noisy labels takes O(cl) per step, with c the number of classes and l the number of labels. Moreover, one does not need to compute the whole propagation matrix, but only an $l$ by $l$ submatrix corresponding to interactions between labeled instances. As a result, our approach is best suited to datasets with a large amount of unlabeled data but not many labels. Results are provided for a number of datasets, including MNIST and ISOLET. LGCLVOF appears to be equally or more precise than the adapted gradient-based filter. We show that the best-case accuracy of the embedding of LGCLVOF into LGC yields performance comparable to the best-case of $\ell_1$-based classifiers designed to be robust to label noise. We provide a heuristic to choose the number of removed instances.