论文标题

实体匹配中主动学习方法的全面基准框架

A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching

论文作者

Meduri, Venkata Vamsikrishna, Popa, Lucian, Sen, Prithviraj, Sarwat, Mohamed

论文摘要

实体匹配(EM)是一项核心数据清洁任务,旨在确定对同一现实世界实体的不同提及。主动学习是解决实践中稀缺标记数据的挑战的一种方法,它通过动态收集所需的示例,该示例被甲骨文标记并完善了他们的学习模型(分类器)。在本文中,我们为EM构建了一个统一的主动学习基准框架,使用户可以轻松地将不同的学习算法与适用的示例选择算法相结合。该框架的目的是为从业者启用具体指南,以了解哪种主动学习组合对EM的运作良好。为此,我们使用各种指标,包括EM质量,#Labels和示例选择潜伏期,对产品和出版物域的公开可用数据集进行全面实验,以评估主动学习方法。我们最令人惊讶的结果发现,具有较少标签的积极学习可以学习具有可比质量的分类器作为监督学习。实际上,对于几个数据集,我们表明有一个积极的学习组合可以超过最新的监督学习结果。我们的框架还包括新颖的优化,可以将学习模型的质量提高大约9%的F1分数,并将示例选择潜伏期最多减少10倍,而不会影响模型的质量。

Entity Matching (EM) is a core data cleaning task, aiming to identify different mentions of the same real-world entity. Active learning is one way to address the challenge of scarce labeled data in practice, by dynamically collecting the necessary examples to be labeled by an Oracle and refining the learned model (classifier) upon them. In this paper, we build a unified active learning benchmark framework for EM that allows users to easily combine different learning algorithms with applicable example selection algorithms. The goal of the framework is to enable concrete guidelines for practitioners as to what active learning combinations will work well for EM. Towards this, we perform comprehensive experiments on publicly available EM datasets from product and publication domains to evaluate active learning methods, using a variety of metrics including EM quality, #labels and example selection latencies. Our most surprising result finds that active learning with fewer labels can learn a classifier of comparable quality as supervised learning. In fact, for several of the datasets, we show that there is an active learning combination that beats the state-of-the-art supervised learning result. Our framework also includes novel optimizations that improve the quality of the learned model by roughly 9% in terms of F1-score and reduce example selection latencies by up to 10x without affecting the quality of the model.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源