论文标题
k-均值发展数据流的
K-means for Evolving Data Streams
论文作者
论文摘要
目前,全球生产的数据量正在超出衡量标准,因此必须连续处理大量的无监督数据。主要无监督的数据分析之一是聚类。在流数据方案中,数据是由可能发生概念漂移现象的样本批次序列逐渐构成的。在本文中,我们正式定义了流$ K $ -MEANS(S $ K $ M)问题,这意味着当概念漂移发生时重新启动错误函数。我们提出了一个不依赖概念漂移检测的替代误差函数。我们证明代孕是S $ K $ M错误的良好近似值。因此,我们建议一种算法,每次到达新批次时,都会最大程度地减少此替代错误。我们还提供了一些用于流数据方案的初始化技术。除了提供理论结果外,实验还证明了非平凡初始化方法的融合误差的改善。
Currently the amount of data produced worldwide is increasing beyond measure, thus a high volume of unsupervised data must be processed continuously. One of the main unsupervised data analysis is clustering. In streaming data scenarios, the data is composed by an increasing sequence of batches of samples where the concept drift phenomenon may happen. In this paper, we formally define the Streaming $K$-means(S$K$M) problem, which implies a restart of the error function when a concept drift occurs. We propose a surrogate error function that does not rely on concept drift detection. We proof that the surrogate is a good approximation of the S$K$M error. Hence, we suggest an algorithm which minimizes this alternative error each time a new batch arrives. We present some initialization techniques for streaming data scenarios as well. Besides providing theoretical results, experiments demonstrate an improvement of the converged error for the non-trivial initialization methods.