论文标题

使用弱标签和强标签的嘈杂数据流的积极学习

Active Learning for Noisy Data Streams Using Weak and Strong Labelers

论文作者

Younesian, Taraneh, Epema, Dick, Chen, Lydia Y.

论文摘要

正确标记数据是机器学习中的一项昂贵且具有挑战性的任务,尤其是对于在线数据流。深度学习模型特别需要大量的清洁标记数据,这些数据在现实世界中很难获得。选择有用的数据样本来标记,同时最大程度地减少标签成本对于维持培训过程中的效率至关重要。当面对具有不同专业知识和各自标签成本的多个标签者时,确定选择哪种标签是非平地的。在本文中,我们考虑了一个新颖的弱且强大的标签问题,该问题受到人类自然标记的启发,在存在嘈杂标签的数据流并受到有限预算的限制的情况下。我们提出了一种在线主动学习算法,该算法包括四个步骤:过滤,增加多样性,信息性的样本选择和标签器选择。我们旨在滤除可疑的嘈杂样本,并以具有成本效益的方式使用强大和弱的标签来将预算花在各种信息数据上。我们得出了一个决策函数,该函数通过结合单个样本的信息并建模信心来衡量信息增益。我们在众所周知的图像分类数据集CIFAR10和CIFAR100上评估了我们提出的算法,噪声高达60%。实验表明,通过智能确定要查询哪个标签,我们的算法保持了相同的精度,与只有一个可用的标签在少花费预算的情况下,只有一个可用的标签。

Labeling data correctly is an expensive and challenging task in machine learning, especially for on-line data streams. Deep learning models especially require a large number of clean labeled data that is very difficult to acquire in real-world problems. Choosing useful data samples to label while minimizing the cost of labeling is crucial to maintain efficiency in the training process. When confronted with multiple labelers with different expertise and respective labeling costs, deciding which labeler to choose is nontrivial. In this paper, we consider a novel weak and strong labeler problem inspired by humans natural ability for labeling, in the presence of data streams with noisy labels and constrained by a limited budget. We propose an on-line active learning algorithm that consists of four steps: filtering, adding diversity, informative sample selection, and labeler selection. We aim to filter out the suspicious noisy samples and spend the budget on the diverse informative data using strong and weak labelers in a cost-effective manner. We derive a decision function that measures the information gain by combining the informativeness of individual samples and model confidence. We evaluate our proposed algorithm on the well-known image classification datasets CIFAR10 and CIFAR100 with up to 60% noise. Experiments show that by intelligently deciding which labeler to query, our algorithm maintains the same accuracy compared to the case of having only one of the labelers available while spending less of the budget.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源