改善积极的未标记学习：实用的AUL估计和针对极度不平衡数据集的新培训方法

论文标题

改善积极的未标记学习：实用的AUL估计和针对极度不平衡数据集的新培训方法

Improving Positive Unlabeled Learning: Practical AUL Estimation and New Training Method for Extremely Imbalanced Data Sets

论文作者

Jiang, Liwei, Li, Dan, Wang, Qisheng, Wang, Shuai, Wang, Songtao

论文摘要

积极的未标记（PU）学习在许多应用中广泛使用，在许多应用中，二进制分类器在仅由正面和未标记的样本组成的数据集中培训。在本文中，我们从两个方面改善了PU学习而不是最先进的学习。首先，现有的PU学习模型评估方法需要未标记的样本的基本真理，在实践中不太可能获得。为了释放此限制，我们提出了一种渐近无偏的实用AUL（升力下的区域）估计方法，该方法使用原始PU数据，而无需事先了解未标记的样品。其次，我们提出了一种针对极度不平衡数据集的新培训方法，其中未标记的样本的数量是阳性样品的数百或数千倍。概率将概率引入了聚合方法。具体而言，每个未标记的样品被标记为正或负，其基于与正邻居的相似性计算的概率。基于此，生成了多个数据集来训练不同的模型，然后将其组合成一个集合模型。与最先进的工作相比，实验结果表明，基于三个工业和两个人工PU数据集，漏斗可以将AUC提高多达10％。

Positive Unlabeled (PU) learning is widely used in many applications, where a binary classifier is trained on the datasets consisting of only positive and unlabeled samples. In this paper, we improve PU learning over state-of-the-art from two aspects. Firstly, existing model evaluation methods for PU learning requires ground truth of unlabeled samples, which is unlikely to be obtained in practice. In order to release this restriction, we propose an asymptotic unbiased practical AUL (area under the lift) estimation method, which makes use of raw PU data without prior knowledge of unlabeled samples. Secondly, we propose ProbTagging, a new training method for extremely imbalanced data sets, where the number of unlabeled samples is hundreds or thousands of times that of positive samples. ProbTagging introduces probability into the aggregation method. Specifically, each unlabeled sample is tagged positive or negative with the probability calculated based on the similarity to its positive neighbors. Based on this, multiple data sets are generated to train different models, which are then combined into an ensemble model. Compared to state-of-the-art work, the experimental results show that ProbTagging can increase the AUC by up to 10%, based on three industrial and two artificial PU data sets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题