论文标题

用于聚类的分类数据集的高效$ k $ -modes算法

An Efficient $k$-modes Algorithm for Clustering Categorical Datasets

论文作者

Dorman, Karin S., Maitra, Ranjan

论文摘要

在许多应用程序中,来自数据的采矿集群是一项重要的努力。 $ k $ -Means方法是一种流行,高效且无分布的方法,用于聚类数值数据,但不适用于分类值观测值。 $ k $ modes方法通过用锤子距离替换欧几里得人士来解决此空隙,并以$ k $ -Mean-Means目标功能中的模式替换euclidean。我们提供了一种新颖的,计算高效的实现,称为otqt。我们证明,OTQT找到了更新,以改善现有$ K $ modes算法无法检测到的目标函数。尽管由于算法复杂性而导致的每次迭代稍慢,但OTQT总是在迭代中更准确,并且几乎总是更快地(在某些数据集中勉强慢)到最终最佳。因此,我们建议OTQT作为$ k $ modes优化的首选,默认算法。

Mining clusters from data is an important endeavor in many applications. The $k$-means method is a popular, efficient, and distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. The $k$-modes method addresses this lacuna by replacing the Euclidean with the Hamming distance and the means with the modes in the $k$-means objective function. We provide a novel, computationally efficient implementation of $k$-modes, called OTQT. We prove that OTQT finds updates to improve the objective function that are undetectable to existing $k$-modes algorithms. Although slightly slower per iteration due to algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. Thus, we recommend OTQT as the preferred, default algorithm for $k$-modes optimization.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源