论文标题

用于支持感知直方图的流算法

Streaming Algorithms for Support-Aware Histograms

论文作者

Chen, Justin Y., Indyk, Piotr, Wagner, Tal

论文摘要

直方图,即零件恒定近似值,是用于表示数据分布的流行工具。传统上,使用$ L_P $ NORM测量直方图和潜在分布之间的差异(即近似误差),这总和了两个函数与域中所有项目的差异。尽管在许多应用程序中有用,但此误差度量的缺点是,它以相同方式处理所有项目的近似错误,而不论项目的质量对于使用近似值的下游应用程序是否重要。结果,即使相对简单的分布也无法通过简洁的直方图近似,而不会产生大误差。 在本文中,我们通过调整近似的定义来解决此问题,以便仅考虑属于分布支持的项目的错误。在此定义下,我们开发了有效的1次通行和2通流算法,这些算法在亚线性空间中计算近乎最佳的直方图。我们还介绍了此问题的空间复杂性的下限。令人惊讶的是,在这个误差概念下,1频道和2频道流算法的空间复杂性存在指数差距。最后,我们在真实和合成数据集的集合中演示了算法的实用性。

Histograms, i.e., piece-wise constant approximations, are a popular tool used to represent data distributions. Traditionally, the difference between the histogram and the underlying distribution (i.e., the approximation error) is measured using the $L_p$ norm, which sums the differences between the two functions over all items in the domain. Although useful in many applications, the drawback of this error measure is that it treats approximation errors of all items in the same way, irrespective of whether the mass of an item is important for the downstream application that uses the approximation. As a result, even relatively simple distributions cannot be approximated by succinct histograms without incurring large error. In this paper, we address this issue by adapting the definition of approximation so that only the errors of the items that belong to the support of the distribution are considered. Under this definition, we develop efficient 1-pass and 2-pass streaming algorithms that compute near-optimal histograms in sub-linear space. We also present lower bounds on the space complexity of this problem. Surprisingly, under this notion of error, there is an exponential gap in the space complexity of 1-pass and 2-pass streaming algorithms. Finally, we demonstrate the utility of our algorithms on a collection of real and synthetic data sets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源