论文标题
Saphyra:一种学习理论方法,用于在大型网络中排名节点
SaPHyRa: A Learning Theory Approach to Ranking Nodes in Large Networks
论文作者
论文摘要
在大规模网络中,基于其中心性的节点是一个基本的,但具有挑战性的问题。近似方法可以快速估计节点的中心性并确定最中心的节点,但是大多数剩余节点的排名可能毫无意义。例如,众所周知,搜索查询中鲜为人知的网站的排名是嘈杂且不稳定的。为此,我们研究了一个新的节点排名问题,具有两个重要区别:a)排名质量,而不是中心性估计质量,作为主要目标; b)仅排名感兴趣的节点,例如符合搜索标准的网站。我们提出了样本空间分配的假设排名或Saphyra,将节点排名转化为机器学习中的假设排名。这种转换将节点的核心性映射到假设的预期风险,为理论机器学习(ML)工具打开门。 Saphyra的关键是将样品空间分为精确和近似子空间。确切的子空间包含与感兴趣的节点相关的样本,从而提高了估计和排名质量。可以用基于ML的技术有效地对近似空间进行采样,以在估计误差上提供理论保证。最后,我们提出了Saphyra_BC,这是Saphyra关于排名节点中心(BC)的插图。通过组合新型的双组分采样,2-HOP样本分区以及改善Vapnik-Chervonenkis维度的界限,Saphyra_BC可以有效地对BC中的任何节点子集进行排名。在近似BC中,其性能比最先进的方法快200倍,而其与地面真相的等级相关性得到了多重增长。
Ranking nodes based on their centrality stands a fundamental, yet, challenging problem in large-scale networks. Approximate methods can quickly estimate nodes' centrality and identify the most central nodes, but the ranking for the majority of remaining nodes may be meaningless. For example, ranking for less-known websites in search queries is known to be noisy and unstable. To this end, we investigate a new node ranking problem with two important distinctions: a) ranking quality, rather than the centrality estimation quality, as the primary objective; and b) ranking only nodes of interest, e.g., websites that matched search criteria. We propose Sample space Partitioning Hypothesis Ranking, or SaPHyRa, that transforms node ranking into a hypothesis ranking in machine learning. This transformation maps nodes' centrality to the expected risks of hypotheses, opening doors for theoretical machine learning (ML) tools. The key of SaPHyRa is to partition the sample space into exact and approximate subspaces. The exact subspace contains samples related to the nodes of interest, increasing both estimation and ranking qualities. The approximate space can be efficiently sampled with ML-based techniques to provide theoretical guarantees on the estimation error. Lastly, we present SaPHyRa_bc, an illustration of SaPHyRa on ranking nodes' betweenness centrality (BC). By combining a novel bi-component sampling, a 2-hop sample partitioning, and improved bounds on the Vapnik-Chervonenkis dimension, SaPHyRa_bc can effectively rank any node subset in BC. Its performance is up to 200x faster than state-of-the-art methods in approximating BC, while its rank correlation to the ground truth is improved by multifold.