通过风险抽样对实体解决的积极深入学习

论文标题

通过风险抽样对实体解决的积极深入学习

Active Deep Learning on Entity Resolution by Risk Sampling

论文作者

Nafa, Youcef, Chen, Qun, Chen, Zhaoqiang, Lu, Xingyu, He, Haiyang, Duan, Tianyi, Li, Zhanhuai

论文摘要

尽管通过深度学习实现了实体分辨率（ER）的最新性能，但其有效性取决于大量准确标记的培训数据。为了减轻数据标签负担，主动学习（AL）将自己作为一种可行的解决方案，重点是对模型培训有用的数据。在最近对ER的风险分析进展的基础上，与简单的分类器输出相比，它可以对标签错误预测风险进行更精致的估计，我们提出了一种新型的ER风险采样方法。风险抽样利用主动实例选择的错误预测风险估计。基于AL的核心表征，我们从理论上得出了一个优化模型，该模型旨在最大程度地减少核心损失，而不是均匀的Lipschitz连续性。由于已定义的加权K-摩托素问题是NP-硬化，因此我们提出了一种有效的启发式算法。最后，我们通过比较研究从经验上验证了所提出的方法对实际数据的疗效。我们的广泛实验表明，它通过相当大的利润率优于现有替代方案。使用ER作为测试案例，我们证明了风险采样是一种有希望的方法，可能适用于其他具有挑战性的分类任务。

While the state-of-the-art performance on entity resolution (ER) has been achieved by deep learning, its effectiveness depends on large quantities of accurately labeled training data. To alleviate the data labeling burden, Active Learning (AL) presents itself as a feasible solution that focuses on data deemed useful for model training. Building upon the recent advances in risk analysis for ER, which can provide a more refined estimate on label misprediction risk than the simpler classifier outputs, we propose a novel AL approach of risk sampling for ER. Risk sampling leverages misprediction risk estimation for active instance selection. Based on the core-set characterization for AL, we theoretically derive an optimization model which aims to minimize core-set loss with non-uniform Lipschitz continuity. Since the defined weighted K-medoids problem is NP-hard, we then present an efficient heuristic algorithm. Finally, we empirically verify the efficacy of the proposed approach on real data by a comparative study. Our extensive experiments have shown that it outperforms the existing alternatives by considerable margins. Using ER as a test case, we demonstrate that risk sampling is a promising approach potentially applicable to other challenging classification tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题