$ k $ -Means的渐近学

论文标题

$ k $ -Means的渐近学

Asymptotics for The $k$-means

论文作者

Zhang, Tonglin

论文摘要

$ k $ - 金钱是统计和计算机科学中最重要的无监督学习技术之一。目的是将数据集划分为许多群集，以便簇中的观察是最均匀的，并且簇之间的观察是最异构的。尽管众所周知，但对渐近性质的调查远远落后，这导致在实践中开发更精确的$ k $ - 均值方法的困难。为了解决这个问题，提出了一个称为聚类一致性的新概念。从根本上讲，所提出的聚类一致性比聚类方法的先前标准一致性更合适。使用此概念，提出了一种新的$ k $ -means方法。发现所提出的$ k $ -MEANS方法的聚类错误率较低，并且对小簇和离群值比现有的$ k $ -MEANS方法更强大。当$ k $未知时，使用差距统计信息，提出的方法还可以识别簇的数量。许多软件包采用的现有$ k $ - 均值方法很少实现这一目标。

The $k$-means is one of the most important unsupervised learning techniques in statistics and computer science. The goal is to partition a data set into many clusters, such that observations within clusters are the most homogeneous and observations between clusters are the most heterogeneous. Although it is well known, the investigation of the asymptotic properties is far behind, leading to difficulties in developing more precise $k$-means methods in practice. To address this issue, a new concept called clustering consistency is proposed. Fundamentally, the proposed clustering consistency is more appropriate than the previous criterion consistency for the clustering methods. Using this concept, a new $k$-means method is proposed. It is found that the proposed $k$-means method has lower clustering error rates and is more robust to small clusters and outliers than existing $k$-means methods. When $k$ is unknown, using the Gap statistics, the proposed method can also identify the number of clusters. This is rarely achieved by existing $k$-means methods adopted by many software packages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题