论文标题
迈向K-Median和K-均值核心的最佳下限
Towards Optimal Lower Bounds for k-median and k-means Coresets
论文作者
论文摘要
鉴于公制空间中的一组积分,$(k,z)$ - 聚类问题包括找到一组称为中心的$ k $点,以便最小化每个数据点的$ z $ $ z $的距离之和最小化。特殊案例包括著名的K-Median问题($ z = 1 $)和k-means问题($ z = 2 $)。 The $k$-median and $k$-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative $(1 \pm \varepsilon)$ factor, hence reducing from large to small scale the input to the problem. 在本文中,我们提出了各种度量空间中核心的改进。在有限指标中由$ n $点组成和两倍的指标,加倍$ d $,我们表明,$(k,z)$ croustering的任何核心都必须由至少$ω组成(k \ varepsilon^{ - 2} \ log n)$ and $ω(k \ varepsilon^v varepsilon^varepsilon^{-22} d)d,这两个界限都与先前的上限与多径因子相匹配。在欧几里得空间中,我们证明了$(k,z)$ clustering的任何核心必须由至少$ω(k \ varepsilon^{ - 2})$点组成。我们将这些下限与一个核心结构进行补充,该核心结构最多由$ \ tilde {o}(k \ varepsilon^{ - 2} \ cdot \ min(\ varepsilon^{ - z},k),k))$ points $。
Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers, such that the sum of distances raised to the power of $z$ of every data point to its closest center is minimized. Special cases include the famous k-median problem ($z = 1$) and k-means problem ($z = 2$). The $k$-median and $k$-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative $(1 \pm \varepsilon)$ factor, hence reducing from large to small scale the input to the problem. In this paper, we present improved lower bounds for coresets in various metric spaces. In finite metrics consisting of $n$ points and doubling metrics with doubling constant $D$, we show that any coreset for $(k,z)$ clustering must consist of at least $Ω(k \varepsilon^{-2} \log n)$ and $Ω(k \varepsilon^{-2} D)$ points, respectively. Both bounds match previous upper bounds up to polylog factors. In Euclidean spaces, we show that any coreset for $(k,z)$ clustering must consists of at least $Ω(k\varepsilon^{-2})$ points. We complement these lower bounds with a coreset construction consisting of at most $\tilde{O}(k\varepsilon^{-2}\cdot \min(\varepsilon^{-z},k))$ points.