论文标题

wor and $ p $:$ \ ell_p $的草图 - 不替代的采样

WOR and $p$'s: Sketches for $\ell_p$-Sampling Without Replacement

论文作者

Cohen, Edith, Pagh, Rasmus, Woodruff, David P.

论文摘要

加权采样是数据分析和机器学习管道中的基本工具。样品用于有效估计统计信息或作为数据的稀疏表示。当重量分布偏差时,就像实际上一样,没有替换(WOR)采样的情况比使用更换(WR)采样更有效:它为相同数量的样品提供了更广泛的表示和更高的准确性。我们根据频率的功率$ p \在[0,2] $中的功率$ p \(或用于签名的数据,更新之和更新之和),设计了$ \ ell_p $采样的新颖合并草图,键的加权采样。我们的草图的大小仅随样本量线性增长。尽管进行了复杂的分析,但我们的设计还是简单而实用的,并且基于对广泛实现的重型击球手素描(例如Countsketch)的现成。我们的方法是第一个在$ p> 1 $的重要制度中提供更糟糕的采样的方法,也是第一个以$ p> 0 $处理签名更新的方法。

Weighted sampling is a fundamental tool in data analysis and machine learning pipelines. Samples are used for efficient estimation of statistics or as sparse representations of the data. When weight distributions are skewed, as is often the case in practice, without-replacement (WOR) sampling is much more effective than with-replacement (WR) sampling: it provides a broader representation and higher accuracy for the same number of samples. We design novel composable sketches for WOR $\ell_p$ sampling, weighted sampling of keys according to a power $p\in[0,2]$ of their frequency (or for signed data, sum of updates). Our sketches have size that grows only linearly with the sample size. Our design is simple and practical, despite intricate analysis, and based on off-the-shelf use of widely implemented heavy hitters sketches such as CountSketch. Our method is the first to provide WOR sampling in the important regime of $p>1$ and the first to handle signed updates for $p>0$.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源