论文标题

在随机块模型中恢复不平衡的社区,并应用有故障的Oracle聚类

Recovering Unbalanced Communities in the Stochastic Block Model With Application to Clustering with a Faulty Oracle

论文作者

Mukherjee, Chandra Sekhar, Peng, Pan, Zhang, Jiapeng

论文摘要

随机块模型(SBM)是研究网络中图形聚类或社区检测的基本模型。在过去的十年中,它受到了极大的关注,并且平衡的案例,即假设所有簇都有大尺寸,都已得到很好的研究。但是,我们对不平衡社区的SBM的理解(可以说,在实践中更相关)仍然是有限的。在本文中,我们提供了一种简单的基于SVD的算法,用于恢复具有不同大小的社区的SBM社区。我们根据Ailon,Chen和Xu的结果进行改进[ICML 2013; JMLR 2015]通过删除假设存在很大的间隔,以使簇的大小不会落入其中,并且还消除了可回收簇对基础簇数量的依赖性。我们通过实验比较进一步补充了我们的理论改进。在种植的集团猜想下,当概率参数恒定时,可以通过我们的算法恢复的集群的大小几乎是最佳的(最多可达多层因子)。 作为副产品,我们在有缺陷的甲骨文模型中获得具有sublinear查询复杂性的有效聚类算法,该模型能够检测大于$ \tildeΩ({\ sqrt {n}}} $的所有簇,即使在$ω(n)$小群众的情况下,即使在图中也有图形。相比之下,如果使用超过$ \ \tildeΩ(n^{2/5})$小型群集,则使用均值数量的查询的先前有效算法无法恢复任何大型群集。

The stochastic block model (SBM) is a fundamental model for studying graph clustering or community detection in networks. It has received great attention in the last decade and the balanced case, i.e., assuming all clusters have large size, has been well studied. However, our understanding of SBM with unbalanced communities (arguably, more relevant in practice) is still limited. In this paper, we provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes. We improve upon a result of Ailon, Chen and Xu [ICML 2013; JMLR 2015] by removing the assumption that there is a large interval such that the sizes of clusters do not fall in, and also remove the dependency of the size of the recoverable clusters on the number of underlying clusters. We further complement our theoretical improvements with experimental comparisons. Under the planted clique conjecture, the size of the clusters that can be recovered by our algorithm is nearly optimal (up to poly-logarithmic factors) when the probability parameters are constant. As a byproduct, we obtain an efficient clustering algorithm with sublinear query complexity in a faulty oracle model, which is capable of detecting all clusters larger than $\tildeΩ({\sqrt{n}})$, even in the presence of $Ω(n)$ small clusters in the graph. In contrast, previous efficient algorithms that use a sublinear number of queries are incapable of recovering any large clusters if there are more than $\tildeΩ(n^{2/5})$ small clusters.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源