基于相似性指标的数据匹配的监督机器学习技术

论文标题

基于相似性指标的数据匹配的监督机器学习技术

Supervised machine learning techniques for data matching based on similarity metrics

论文作者

Verschuuren, Pim, Palazzo, Serena, Powell, Tom, Sutton, Steve, Pilgrim, Alfred, Giannelli, Michele Faucci

论文摘要

企业，政府机构和非政府组织拥有越来越多的数据，他们试图从中提取有价值的信息。通常，这不仅需要准确地完成，而且需要在短时间内进行。因此，清洁和一致的数据至关重要。数据匹配是试图识别数据中引用相同现实世界实体的实例的字段。在这项研究中，机器学习技术与与数据匹配字段的字符串相似性函数结合在一起。通过分组方案进行了预处理，以减少配对维度，并使用一组相似性功能来量化发票对之间的相似性，从而对各种业务和组织的发票数据集进行了预处。然后，将所得的发票对数据集用于训练和验证神经网络和增强决策树。将财政技术的解决方案作为基准与当前可用的重复数据删除解决方案进行了比较。神经网络和增强决策树都表现出等于更好的性能。

Businesses, governmental bodies and NGO's have an ever-increasing amount of data at their disposal from which they try to extract valuable information. Often, this needs to be done not only accurately but also within a short time frame. Clean and consistent data is therefore crucial. Data matching is the field that tries to identify instances in data that refer to the same real-world entity. In this study, machine learning techniques are combined with string similarity functions to the field of data matching. A dataset of invoices from a variety of businesses and organizations was preprocessed with a grouping scheme to reduce pair dimensionality and a set of similarity functions was used to quantify similarity between invoice pairs. The resulting invoice pair dataset was then used to train and validate a neural network and a boosted decision tree. The performance was compared with a solution from FISCAL Technologies as a benchmark against currently available deduplication solutions. Both the neural network and boosted decision tree showed equal to better performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题