在公制和欧几里得空间中的公平聚类及其应用

论文标题

在公制和欧几里得空间中的公平聚类及其应用

On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications

论文作者

Bandyapadhyay, Sayan, Fomin, Fedor V., Simonov, Kirill

论文摘要

公平聚类是聚类的约束变体，目标是分区一组有色点，因此每个群集中任何颜色的点的比例或多或少等于数据集中该颜色的点的比例。该变体最近由Chierichetti等人引入。 [Neurips，2017]在一项开创性的工作中，在聚类文献中广受欢迎。在本文中，我们提出了一个新的核心组结构，以基于随机抽样进行公平聚类。新的结构使我们能够在一般度量空间中获得第一个用于公平聚类的核心。对于欧几里得空间，我们获得了第一个核心，其大小不取决于尺寸。我们的核心结果解决了Schmidt等人提出的开放问题。 [WAOA，2019年]和Huang等。 [Neurips，2019年]。新的核心结构有助于设计几种新的近似和流算法。特别是，我们获得了用于度量公平聚类的第一个真实常数算法，其运行时间是固定参数可拖动的（FPT）。在欧几里得的情况下，我们得出了第一个$（1+ε）$ - 近似算法的公平聚类算法，其时间复杂性接近线性，并且不呈指数级取决于空间的尺寸。此外，我们的核心构建方案相当一般，并引起了各种约束聚类问题的核心。这会导致在欧几里得公制中的一般指标和接近线性的时间$（1+ε）$ - 近似值中的这些问题和接近线性的时间（1+ε）$中的恒定量。

Fair clustering is a constrained variant of clustering where the goal is to partition a set of colored points, such that the fraction of points of any color in every cluster is more or less equal to the fraction of points of this color in the dataset. This variant was recently introduced by Chierichetti et al. [NeurIPS, 2017] in a seminal work and became widely popular in the clustering literature. In this paper, we propose a new construction of coresets for fair clustering based on random sampling. The new construction allows us to obtain the first coreset for fair clustering in general metric spaces. For Euclidean spaces, we obtain the first coreset whose size does not depend exponentially on the dimension. Our coreset results solve open questions proposed by Schmidt et al. [WAOA, 2019] and Huang et al. [NeurIPS, 2019]. The new coreset construction helps to design several new approximation and streaming algorithms. In particular, we obtain the first true constant-approximation algorithm for metric fair clustering, whose running time is fixed-parameter tractable (FPT). In the Euclidean case, we derive the first $(1+ε)$-approximation algorithm for fair clustering whose time complexity is near-linear and does not depend exponentially on the dimension of the space. Besides, our coreset construction scheme is fairly general and gives rise to coresets for a wide range of constrained clustering problems. This leads to improved constant-approximations for these problems in general metrics and near-linear time $(1+ε)$-approximations in the Euclidean metric.

下载PDF全文

下载文献需遵守相关版权规定

论文标题