基于图的半监督学习中标签噪声的分析

论文标题

基于图的半监督学习中标签噪声的分析

Analysis of label noise in graph-based semi-supervised learning

论文作者

Afonso, Bruno Klaus de Aquino, Berton, Lilian

论文摘要

在机器学习中，必须获取标签，以帮助监督能够概括到看不见数据的模型。但是，标签过程可能乏味，长，昂贵且容易出错。通常，我们的大多数数据都没有标记。半监督学习（SSL）通过对标签和输入数据分布之间的关系做出有力的假设来减轻。该范式在实践中取得了成功，但是大多数SSL算法最终完全信任少数可用的标签。在现实生活中，人类和自动化系统都容易出错。至关重要的是，我们的算法能够与既少数又不可靠的标签一起使用。我们的工作旨在对现有的基于图的半监督算法（如高斯字段和谐波功能，局部和全球一致性，laplacian eigenmaps，通过最小化的图形转导）进行广泛的经验评估。为此，我们比较了分类器的准确性，同时改变了许多不同样本的标记数据和标签噪声量。我们的结果表明，如果数据集与SSL假设一致，则我们能够检测到最噪声的实例，尽管当可用标签数量减少时，这会变得更难。同样，当数据来自高维簇时，拉普拉斯征算法的性能要比标签传播更好。

In machine learning, one must acquire labels to help supervise a model that will be able to generalize to unseen data. However, the labeling process can be tedious, long, costly, and error-prone. It is often the case that most of our data is unlabeled. Semi-supervised learning (SSL) alleviates that by making strong assumptions about the relation between the labels and the input data distribution. This paradigm has been successful in practice, but most SSL algorithms end up fully trusting the few available labels. In real life, both humans and automated systems are prone to mistakes; it is essential that our algorithms are able to work with labels that are both few and also unreliable. Our work aims to perform an extensive empirical evaluation of existing graph-based semi-supervised algorithms, like Gaussian Fields and Harmonic Functions, Local and Global Consistency, Laplacian Eigenmaps, Graph Transduction Through Alternating Minimization. To do that, we compare the accuracy of classifiers while varying the amount of labeled data and label noise for many different samples. Our results show that, if the dataset is consistent with SSL assumptions, we are able to detect the noisiest instances, although this gets harder when the number of available labels decreases. Also, the Laplacian Eigenmaps algorithm performed better than label propagation when the data came from high-dimensional clusters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题