论文标题

varclust:使用尺寸降低的聚类变量

VARCLUST: clustering variables using dimensionality reduction

论文作者

Sobczyk, Piotr, Wilczynski, Stanislaw, Bogdan, Malgorzata, Graczyk, Piotr, Josse, Julie, Panloup, Fabien, Seegers, Valérie, Staniak, Mateusz

论文摘要

在假设给定群集中的变量是少数隐藏的潜在变量的线性组合,提出了用于聚类变量的VARCLUST算法,该变量被随机噪声损坏。整个聚类任务被视为选择统计模型的问题,统计模型的选择是由簇数,变量分配到这些簇中的分配以及“群集维度”的定义,即线性亚空间维度的向量跨越了每个集群。使用基于拉普拉斯近似值的近似贝叶斯标准选择最佳模型,并在簇数量上使用非信息统一的先验。为了解决可能模型的巨大空间搜索的问题,我们提出了Clustofvar算法的扩展,该算法仅用于尺寸为1的子空间1,结构上的结构与$ K $ - $ centroid算法相似。我们提供了一种完整的方法论,具有理论保证,广泛的数值实验,完整的数据分析和实施。我们的算法根据一致的贝叶斯信息标准(BIC)将变量分配给适当的群集,并通过受惩罚的半成有可能的可能性标准(PESEL)估算每个群集的维度,我们证明了我们的一致性。此外,我们证明,算法的每次迭代都会导致拉普拉斯近似值增加到模型后验概率,并为簇数估计簇数提供了标准。与其他算法的数值比较表明,Varclust可能优于一些流行的机器学习工具,用于稀疏子空间集群。我们还报告了实际数据分析的结果,包括TCGA乳腺癌数据和气象数据。所提出的方法是在公开可用的R软件包varclust中实现的。

VARCLUST algorithm is proposed for clustering variables under the assumption that variables in a given cluster are linear combinations of a small number of hidden latent variables, corrupted by the random noise. The entire clustering task is viewed as the problem of selection of the statistical model, which is defined by the number of clusters, the partition of variables into these clusters and the 'cluster dimensions', i.e. the vector of dimensions of linear subspaces spanning each of the clusters. The optimal model is selected using the approximate Bayesian criterion based on the Laplace approximations and using a non-informative uniform prior on the number of clusters. To solve the problem of the search over a huge space of possible models we propose an extension of the ClustOfVar algorithm which was dedicated to subspaces of dimension only 1, and which is similar in structure to the $K$-centroid algorithm. We provide a complete methodology with theoretical guarantees, extensive numerical experimentations, complete data analyses and implementation. Our algorithm assigns variables to appropriate clusterse based on the consistent Bayesian Information Criterion (BIC), and estimates the dimensionality of each cluster by the PEnalized SEmi-integrated Likelihood Criterion (PESEL), whose consistency we prove. Additionally, we prove that each iteration of our algorithm leads to an increase of the Laplace approximation to the model posterior probability and provide the criterion for the estimation of the number of clusters. Numerical comparisons with other algorithms show that VARCLUST may outperform some popular machine learning tools for sparse subspace clustering. We also report the results of real data analysis including TCGA breast cancer data and meteorological data. The proposed method is implemented in the publicly available R package varclust.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源