论文标题
SFTM:使用基于相似性的灵活树匹配的Web文档快速比较
SFTM: Fast Comparison of Web Documents using Similarity-based Flexible Tree Matching
论文作者
论文摘要
已经在许多领域(包括网络数据挖掘和提取)进行了研究,作为分析网络文档内容,现有的树匹配方法(例如Tree-Edit距离(TED)或灵活的树匹配(FTM))的关键组成部分,已经研究了树匹配技术,无法扩展到几百个节点以上,该节点远低于现有在线文档的平均复杂性和应用程序和应用程序的平均复杂性。因此,在本文中,我们提出了一种基于相似性的新型柔性树匹配算法(SFTM),该算法是第一个启用具有实用计算时间的现实生活网络文档上的树匹配的算法。特别是,我们将树匹配作为优化问题,并利用节点标签和局部拓扑相似性,以避免任何组合爆炸。我们的实际评估表明,我们的方法与TED的参考实现相比,同时将计算时间提高了两个数量级。
Tree matching techniques have been investigated in many fields, including web data mining and extraction, as a key component to analyze the content of web documents, existing tree matching approaches, like Tree-Edit Distance (TED) or Flexible Tree Matching (FTM), fail to scale beyond a few hundreds of nodes, which is far below the average complexity of existing web online documents and applications. In this paper, we therefore propose a novel Similarity-based Flexible Tree Matching algorithm (SFTM), which is the first algorithm to enable tree matching on real-life web documents with practical computation times. In particular, we approach tree matching as an optimisation problem and we leverage node labels and local topology similarity in order to avoid any combinatorial explosion. Our practical evaluation demonstrates that our approach compares to the reference implementation of TED qualitatively, while improving the computation times by two orders of magnitude.