论文标题
流式传输算法,以最大程度地限制多样性
Streaming Algorithms for Diversity Maximization with Fairness Constraints
论文作者
论文摘要
多样性最大化是数据汇总,Web搜索和推荐系统中广泛应用的基本问题。给定$ n $元素的$ x $元素,它要求选择一个$ k \ ll n $ ements的子集$ s $,具有最大\ emph {多样性},这是由$ s $中的元素之间的差异量化的。在本文中,我们关注流媒体设置中公平限制的多样性最大化问题。具体而言,我们考虑了最大值的多样性目标,该目标选择了一个子集$ s $,该子集$ s $最大化了其中任何一对不同元素之间的最小距离(不同)。假设集合$ x $通过某些敏感属性(例如性别或种族)将$ m $脱节组分为$ m $脱节组,以确保\ emph {fairness}要求所选的子集$ s $包含[1,m] $的每个组$ i \ in Group $ i \的$ k_i $元素。流算法应在一个通过中顺序处理$ x $,并返回具有最大\ emph {多样性}的子集,同时保证公平约束。尽管对多样性的最大化已经进行了广泛的研究,但唯一可以符合最大速度多样性目标和公平性约束的已知算法对于数据流效率低下。由于多样性最大化一般是NP的最大化,因此我们提出了两个在数据流中最大化的公平多样性最大化的近似算法,其中第一个是$ \ frac {1- \ varepsilon} {4} {4} $ - 近似值 - $ m = 2 $,其中$ \ \ varepsilon \ in(0.0,1) $ \ frac {1- \ varepsilon} {3M+2} $ - 任意$ m $的近似值。现实世界和合成数据集的实验结果表明,这两种算法都提供了与最先进算法相当的质量解决方案,而在流设置中运行几个数量级。
Diversity maximization is a fundamental problem with wide applications in data summarization, web search, and recommender systems. Given a set $X$ of $n$ elements, it asks to select a subset $S$ of $k \ll n$ elements with maximum \emph{diversity}, as quantified by the dissimilarities among the elements in $S$. In this paper, we focus on the diversity maximization problem with fairness constraints in the streaming setting. Specifically, we consider the max-min diversity objective, which selects a subset $S$ that maximizes the minimum distance (dissimilarity) between any pair of distinct elements within it. Assuming that the set $X$ is partitioned into $m$ disjoint groups by some sensitive attribute, e.g., sex or race, ensuring \emph{fairness} requires that the selected subset $S$ contains $k_i$ elements from each group $i \in [1,m]$. A streaming algorithm should process $X$ sequentially in one pass and return a subset with maximum \emph{diversity} while guaranteeing the fairness constraint. Although diversity maximization has been extensively studied, the only known algorithms that can work with the max-min diversity objective and fairness constraints are very inefficient for data streams. Since diversity maximization is NP-hard in general, we propose two approximation algorithms for fair diversity maximization in data streams, the first of which is $\frac{1-\varepsilon}{4}$-approximate and specific for $m=2$, where $\varepsilon \in (0,1)$, and the second of which achieves a $\frac{1-\varepsilon}{3m+2}$-approximation for an arbitrary $m$. Experimental results on real-world and synthetic datasets show that both algorithms provide solutions of comparable quality to the state-of-the-art algorithms while running several orders of magnitude faster in the streaming setting.