论文标题
伪标签在自训练线性分类器中的作用在高维高斯混合物数据上
The Role of Pseudo-labels in Self-training Linear Classifiers on High-dimensional Gaussian Mixture Data
论文作者
论文摘要
自我训练(ST)是一种简单而有效的半监督学习方法。但是,为什么使用潜在的错误伪标签来改善ST改善概括性能,但仍未很好地理解。为了加深对ST的理解,我们在训练线性分类器时,通过在渐近限制中训练ridge调节的凸丢失来训练线性分类器时的行为表征,并在渐近极限上降低了ridge调节的凸丢失,而输入尺寸和数据尺寸差异差异。结果表明,根据迭代次数的数量,ST以不同的方式提高了概括。当迭代次数很少时,ST通过将模型拟合到相对可靠的伪标签并在每次迭代时大量更新模型参数来提高概括性能。这表明ST的工作直观。另一方面,通过许多迭代,ST可以通过使用软标签和小规律化来逐步更新模型参数,从而逐渐改善分类平面的方向。有人认为,这是因为ST的小更新几乎可以以一种无噪音的方式从数据中提取信息。但是,在标签不平衡的情况下,ST表现不佳的概括性能通过真正的标签监督学习。为了克服这一点,提出了两种启发式方法,以使ST能够在有明显的标签失衡的情况下实现几乎兼容的性能。
Self-training (ST) is a simple yet effective semi-supervised learning method. However, why and how ST improves generalization performance by using potentially erroneous pseudo-labels is still not well understood. To deepen the understanding of ST, we derive and analyze a sharp characterization of the behavior of iterative ST when training a linear classifier by minimizing the ridge-regularized convex loss on binary Gaussian mixtures, in the asymptotic limit where input dimension and data size diverge proportionally. The results show that ST improves generalization in different ways depending on the number of iterations. When the number of iterations is small, ST improves generalization performance by fitting the model to relatively reliable pseudo-labels and updating the model parameters by a large amount at each iteration. This suggests that ST works intuitively. On the other hand, with many iterations, ST can gradually improve the direction of the classification plane by updating the model parameters incrementally, using soft labels and small regularization. It is argued that this is because the small update of ST can extract information from the data in an almost noiseless way. However, in the presence of label imbalance, the generalization performance of ST underperforms supervised learning with true labels. To overcome this, two heuristics are proposed to enable ST to achieve nearly compatible performance with supervised learning even with significant label imbalance.