论文标题

指导积极学习的示例

Exemplar Guided Active Learning

论文作者

Hartford, Jason, Leyton-Brown, Kevin, Raviv, Hadas, Padnos, Dan, Lev, Shahar, Lenz, Barak

论文摘要

我们认为明智地使用有限预算来标记大型未标记数据集的一小部分的问题。我们是受词理性歧义的NLP问题的动机。对于任何单词,我们都有一组来自知识库的候选标签,但是标签集不一定代表数据中发生的情况:在知识库中可能存在很少出现在语料库中的标签,因为在现代英语中,这种感觉很少见;相反,我们的知识库中可能存在真正的标签。我们的目的是获得一个分类器,该分类器在每个“公共类”的示例中尽可能地执行,该分类器在未标记的集合中以高于给定阈值的频率发生,同时注释了少量示例的“稀有类”,其标签的出现的频率小于此频率。面临的挑战是,我们没有知道哪些标签是常见且罕见的标签,而真正的标签分布可能会表现出极端的偏斜。我们描述了一种主动的学习方法,(1)通过利用现代语言模型提供的上下文嵌入空间来明确搜索稀有类别,并且(2)结合了一个停止规则,一旦我们证明它们出现在目标阈值以下,并且具有很高的可能性。我们证明,我们的算法仅比对数的成本要比知道所有真实标签频率的假设方法的成本高,并在实验上表明,合并自动搜索可以显着减少达到目标准确性水平所需的样品数量。

We consider the problem of wisely using a limited budget to label a small subset of a large unlabeled dataset. We are motivated by the NLP problem of word sense disambiguation. For any word, we have a set of candidate labels from a knowledge base, but the label set is not necessarily representative of what occurs in the data: there may exist labels in the knowledge base that very rarely occur in the corpus because the sense is rare in modern English; and conversely there may exist true labels that do not exist in our knowledge base. Our aim is to obtain a classifier that performs as well as possible on examples of each "common class" that occurs with frequency above a given threshold in the unlabeled set while annotating as few examples as possible from "rare classes" whose labels occur with less than this frequency. The challenge is that we are not informed which labels are common and which are rare, and the true label distribution may exhibit extreme skew. We describe an active learning approach that (1) explicitly searches for rare classes by leveraging the contextual embedding spaces provided by modern language models, and (2) incorporates a stopping rule that ignores classes once we prove that they occur below our target threshold with high probability. We prove that our algorithm only costs logarithmically more than a hypothetical approach that knows all true label frequencies and show experimentally that incorporating automated search can significantly reduce the number of samples needed to reach target accuracy levels.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源