概率的分配分区（PPP）

论文标题

概率的分配分区（PPP）

Probabilistic Partitive Partitioning (PPP)

论文作者

Sultan, Mujahid

论文摘要

聚类是一个NP硬性问题。因此，不存在最佳算法，可以应用启发式方法来聚集数据。如果无法正确应用，启发式方法可能是非常密集的。对于基本大的数据集，如果可以实现最小的信息丢失，则可以通过减少输入空间来实现计算效率。通常，聚类算法遇到两个常见问题：1）这些问题会收敛到具有不同初始条件的不同设置； 2）必须事先任意决定集群的数量。这个问题在大数据领域变得至关重要。最近，出现了聚类算法，可以使用网格上的并行处理加速计算，但面临上述问题。目标：我们的目标是找到聚集数据的方法：1）确保与初始条件无关的是相同设置的收敛； 2）消除事先建立簇数的需求，3）可以应用于群集大数据集。方法：我们引入了一种结合概率和组合聚类方法的方法，以产生对初始条件不敏感的可重复和紧凑簇。该方法利用K-均值（一种组合聚类方法）的功率来簇/分区非常大的维数数据集，并使用高斯混合模型（一种概率聚类方法）来验证K-均值分区。结果：我们表明，此方法会产生非常紧凑的簇，这些簇对初始条件不敏感。该方法可用于识别数据集中最“可分离的”集，该数据集增加了数据集的“凝聚力”。此方法还消除了提前指定簇数的需求。

Clustering is a NP-hard problem. Thus, no optimal algorithm exists, heuristics are applied to cluster the data. Heuristics can be very resource-intensive, if not applied properly. For substantially large data sets computational efficiencies can be achieved by reducing the input space if a minimal loss of information can be achieved. Clustering algorithms, in general, face two common problems: 1) these converge to different settings with different initial conditions and; 2) the number of clusters has to be arbitrarily decided beforehand. This problem has become critical in the realm of big data. Recently, clustering algorithms have emerged which can speedup computations using parallel processing over the grid but face the aforementioned problems. Goals: Our goals are to find methods to cluster data which: 1) guarantee convergence to the same settings irrespective of the initial conditions; 2) eliminate the need to establish the number of clusters beforehand, and 3) can be applied to cluster large datasets. Methods: We introduce a method that combines probabilistic and combinatorial clustering methods to produce repeatable and compact clusters that are not sensitive to initial conditions. This method harnesses the power of k-means (a combinatorial clustering method) to cluster/partition very large dimensional datasets and uses the Gaussian Mixture Model (a probabilistic clustering method) to validate the k-means partitions. Results: We show that this method produces very compact clusters that are not sensitive to initial conditions. This method can be used to identify the most 'separable' set in a dataset which increases the 'clusterability' of a dataset. This method also eliminates the need to specify the number of clusters in advance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题