每个注释者的软标签启发和学习

论文标题

每个注释者的软标签启发和学习

Eliciting and Learning with Soft Labels from Every Annotator

论文作者

Collins, Katherine M., Bhatt, Umang, Weller, Adrian

论文摘要

用于训练机器学习（ML）模型的标签至关重要。通常，对于ML分类任务，数据集包含硬标签，但已证明使用软标签的学习可以产生模型概括，鲁棒性和校准的好处。较早的工作发现从多个注释者硬标签形成软标签方面的成功；但是，这种方法可能不会融合到最佳标签，并且需要许多注释者，这可能是昂贵且效率低下的。我们专注于有效地从单个注释者那里引起软标签。我们通过众包研究（n = 248）收集并发布了一个软标签数据集（我们称为CIFAR-10S）。我们证明，使用标签的学习能够达到可比的模型性能与先前方法的可比性性能，同时需要更少的注释器 - 尽管每次启发时具有大量的时间成本。因此，我们的启发方法表明，在使从业人员能够享受改善模型性能和可靠性的好处，并减少注释者的好处，并为将来的数据集策展人提供了有关利用更丰富信息的好处的指南，例如来自单个注释者的富裕信息，例如分类不确定性。

The labels used to train machine learning (ML) models are of paramount importance. Typically for ML classification tasks, datasets contain hard labels, yet learning using soft labels has been shown to yield benefits for model generalization, robustness, and calibration. Earlier work found success in forming soft labels from multiple annotators' hard labels; however, this approach may not converge to the best labels and necessitates many annotators, which can be expensive and inefficient. We focus on efficiently eliciting soft labels from individual annotators. We collect and release a dataset of soft labels (which we call CIFAR-10S) over the CIFAR-10 test set via a crowdsourcing study (N=248). We demonstrate that learning with our labels achieves comparable model performance to prior approaches while requiring far fewer annotators -- albeit with significant temporal costs per elicitation. Our elicitation methodology therefore shows nuanced promise in enabling practitioners to enjoy the benefits of improved model performance and reliability with fewer annotators, and serves as a guide for future dataset curators on the benefits of leveraging richer information, such as categorical uncertainty, from individual annotators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题