实时无索引单源Simrank在Web尺度图上处理

论文标题

实时无索引单源Simrank在Web尺度图上处理

Realtime Index-Free Single Source SimRank Processing on Web-Scale Graphs

论文作者

Shi, Jieming, Jin, Tianyuan, Yang, Renchi, Xiao, Xiaokui, Yang, Yin

论文摘要

给定G中G和A节点U中的G中，单个源Simrank查询评估了G中的u和每个节点V之间的相似性。单个源simrank计算的现有方法会产生长时间的查询响应时间或昂贵的预计算时间，或者需要在图形g更改时需要再次执行。因此，据我们所知，它们都不是（i）（i）查询处理的场景理想的理想之选，并且（ii）基础图G是巨大的，并且经常更新。在此激励的情况下，我们提出了Simpush，这是一种新型算法，该算法回答单一源simrank查询而没有任何预发行，同时也达到了比最快的基于索引的解决方案的查询处理速度明显更高。此外，Simpush提供了严格的结果质量保证，其高性能不依赖于基础图的任何强有力的假设。具体而言，与现有方法相比，Simpush采用了根本不同的算法设计，重点是（i）识别与查询相关的少数节点，然后（ii）计算统计信息和仅从这些节点中执行残留物。我们证明了Simpush的正确性，分析其时间复杂性，并将其渐近性能与现有方法进行比较。同时，我们通过在8个真实数据集上进行了大量实验来评估Simpush的实际性能。结果表明，Simpush始终超出所有现有解决方案，通常会超过一个数量级。特别是，在商品机器上，Simpush在Web图上回答了单个源Simrank查询，该网络图中包含超过1.33亿个节点和54亿个边缘，在62毫秒不足的情况下，有0.00035的经验错误，而基于最快的索引竞争对手则需要1.18秒。

Given a graph G and a node u in G, a single source SimRank query evaluates the similarity between u and every node v in G. Existing approaches to single source SimRank computation incur either long query response time, or expensive pre-computation, which needs to be performed again whenever the graph G changes. Consequently, to our knowledge none of them is ideal for scenarios in which (i) query processing must be done in realtime, and (ii) the underlying graph G is massive, with frequent updates. Motivated by this, we propose SimPush, a novel algorithm that answers single source SimRank queries without any pre-computation, and at the same time achieves significantly higher query processing speed than even the fastest known index-based solutions. Further, SimPush provides rigorous result quality guarantees, and its high performance does not rely on any strong assumption of the underlying graph. Specifically, compared to existing methods, SimPush employs a radically different algorithmic design that focuses on (i) identifying a small number of nodes relevant to the query, and subsequently (ii) computing statistics and performing residue push from these nodes only. We prove the correctness of SimPush, analyze its time complexity, and compare its asymptotic performance with that of existing methods. Meanwhile, we evaluate the practical performance of SimPush through extensive experiments on 8 real datasets. The results demonstrate that SimPush consistently outperforms all existing solutions, often by over an order of magnitude. In particular, on a commodity machine, SimPush answers a single source SimRank query on a web graph containing over 133 million nodes and 5.4 billion edges in under 62 milliseconds, with 0.00035 empirical error, while the fastest index-based competitor needs 1.18 seconds.

下载PDF全文

下载文献需遵守相关版权规定

论文标题