论文标题
基于流动的算法用于改进集群:统一的框架,软件和性能
Flow-based Algorithms for Improving Clusters: A Unifying Framework, Software, and Performance
论文作者
论文摘要
矢量空间中的聚类点或图中的节点是统计数据分析中普遍存在的原始性,并且通常用于探索性数据分析。在实践中,“完善”或“改善”给定群集通常是其他方法获得的,通常是有意义的。在这项调查中,我们专注于此集群改进问题的原则性算法。许多这样的群集改进算法是基于流的方法,我们的意思是,在操作上,它们需要在A(通常隐含)修改的数据图上解决最大流问题序列的解决方案。这些群集改进算法在理论上还是在实践上都是强大的,但是在社区发现,当地图,半监督学习等问题等问题上,它们并未被广泛采用。可能的原因是:这些算法的陡峭学习曲线;缺乏高效且易于使用的软件;以及缺乏对现实数据证明其有用性的详细数值实验。我们的目标是解决这些问题。为此,我们指导读者了解如何实施和应用这些强大的算法的整个过程。我们提出了一个统一的分数编程优化框架,该框架使我们可以简单地提炼所有这些算法的关键组成部分。它还使相关方法之间的明显相似性和差异。通过分数编程框架查看这些集群改进算法为未来算法开发提供了方向。最后,我们在LocalGraphClustering Python软件包中开发了这些算法的有效实现,并执行了广泛的数值实验,以证明这些方法在社交网络和基于图像的数据图上的性能。
Clustering points in a vector space or nodes in a graph is a ubiquitous primitive in statistical data analysis, and it is commonly used for exploratory data analysis. In practice, it is often of interest to "refine" or "improve" a given cluster that has been obtained by some other method. In this survey, we focus on principled algorithms for this cluster improvement problem. Many such cluster improvement algorithms are flow-based methods, by which we mean that operationally they require the solution of a sequence of maximum flow problems on a (typically implicitly) modified data graph. These cluster improvement algorithms are powerful, both in theory and in practice, but they have not been widely adopted for problems such as community detection, local graph clustering, semi-supervised learning, etc. Possible reasons for this are: the steep learning curve for these algorithms; the lack of efficient and easy to use software; and the lack of detailed numerical experiments on real-world data that demonstrate their usefulness. Our objective here is to address these issues. To do so, we guide the reader through the whole process of understanding how to implement and apply these powerful algorithms. We present a unifying fractional programming optimization framework that permits us to distill, in a simple way, the crucial components of all these algorithms. It also makes apparent similarities and differences between related methods. Viewing these cluster improvement algorithms via a fractional programming framework suggests directions for future algorithm development. Finally, we develop efficient implementations of these algorithms in our LocalGraphClustering Python package, and we perform extensive numerical experiments to demonstrate the performance of these methods on social networks and image-based data graphs.