论文标题
使用有限混合物的分布式贝叶斯聚类
Distributed Bayesian clustering using finite mixture of mixtures
论文作者
论文摘要
在许多现代应用程序中,人们有兴趣分析巨大的数据集,这些数据集无法轻易地跨计算机移动或在单个计算机上加载到内存中。在这种情况下,对聚类感兴趣非常普遍。现有的分布式聚类算法主要基于距离或密度,而无可能规范,从而排除了形式统计推断的可能性。基于模型的聚类允许统计推断,但是对分布式推理的研究强调了有限混合模型上的非参数贝叶斯混合模型。为了填补这一空白,我们引入了一种几乎令人尴尬的平行算法,用于在高斯混合物的贝叶斯过度拟合有限混合物下进行聚类,我们称其为贝叶斯群集(DIB-C)。 DIB-C可以灵活地适应具有各种形状(例如偏斜或多模式)的数据集。通过随机分区和分布的数据,我们首先以令人尴尬的并行方式运行马尔可夫链蒙特卡洛,以获得局部聚类抽奖,然后根据工人在分区空间上的任何损失功能进行最终聚类估算。 DIB-C还可以估算群集密度,快速对新受试者进行分类并提供后预测分布。模拟研究和实际数据应用都表明,在鲁棒性和计算效率方面,DIB-C的表现出色。
In many modern applications, there is interest in analyzing enormous data sets that cannot be easily moved across computers or loaded into memory on a single computer. In such settings, it is very common to be interested in clustering. Existing distributed clustering algorithms are mostly distance or density based without a likelihood specification, precluding the possibility of formal statistical inference. Model-based clustering allows statistical inference, yet research on distributed inference has emphasized nonparametric Bayesian mixture models over finite mixture models. To fill this gap, we introduce a nearly embarrassingly parallel algorithm for clustering under a Bayesian overfitted finite mixture of Gaussian mixtures, which we term distributed Bayesian clustering (DIB-C). DIB-C can flexibly accommodate data sets with various shapes (e.g. skewed or multi-modal). With data randomly partitioned and distributed, we first run Markov chain Monte Carlo in an embarrassingly parallel manner to obtain local clustering draws and then refine across workers for a final clustering estimate based on any loss function on the space of partitions. DIB-C can also estimate cluster densities, quickly classify new subjects and provide a posterior predictive distribution. Both simulation studies and real data applications show superior performance of DIB-C in terms of robustness and computational efficiency.