论文标题
如何减少实体分辨率的搜索空间:通过阻止或最近的邻居搜索?
How to reduce the search space of Entity Resolution: with Blocking or Nearest Neighbor search?
论文作者
论文摘要
实体分辨率遭受二次时间复杂性。为了提高其时间效率,通常使用三种过滤技术来限制其搜索空间:(i)阻止工作流程,将它们组合在一起,将它们组合在一起,将它们组合在一起,将实体相同或相似的签名组合在一起,(ii)字符串相似性加入算法,这些算法快速地检测到实体比阈值更快地检测到threshold and the Entity,以及(III)近距离的方法,并将其定义为vection a factect,并将其定义为VECTOR,并将其转换为VECTOR,并将其转换为VECTOR,并将其转换为VECTOR,并将 功能。每种类型都提出了许多方法,但是文献缺乏对其相对性能的比较分析。正如我们在这项工作中所显示的那样,由于配置参数对每种过滤技术的性能产生了重大影响,这是一项非平凡的任务。我们执行了第一个系统的实验研究,该研究研究了10个现实世界中的每种类型的主要方法的相对性能。对于每种方法,我们考虑过多的参数配置,以相对于召回和精度进行优化。对于每个数据集,我们同时考虑基于模式和基于模式的设置。实验结果提供了对所考虑技术的有效性和时间效率的新见解,证明了阻止工作流程和弦线相似性的优越性。
Entity Resolution suffers from quadratic time complexity. To increase its time efficiency, three kinds of filtering techniques are typically used for restricting its search space: (i) blocking workflows, which group together entity profiles with identical or similar signatures, (ii) string similarity join algorithms, which quickly detect entities more similar than a threshold, and (iii) nearest-neighbor methods, which convert every entity profile into a vector and quickly detect the closest entities according to the specified distance function. Numerous methods have been proposed for each type, but the literature lacks a comparative analysis of their relative performance. As we show in this work, this is a non-trivial task, due to the significant impact of configuration parameters on the performance of each filtering technique. We perform the first systematic experimental study that investigates the relative performance of the main methods per type over 10 real-world datasets. For each method, we consider a plethora of parameter configurations, optimizing it with respect to recall and precision. For each dataset, we consider both schema-agnostic and schema-based settings. The experimental results provide novel insights into the effectiveness and time efficiency of the considered techniques, demonstrating the superiority of blocking workflows and string similarity joins.