使用牙龈分布的基数估算

论文标题

使用牙龈分布的基数估算

Cardinality estimation using Gumbel distribution

论文作者

Łukasiewicz, Aleksander, Uznański, Przemysław

论文摘要

基数估计是近似具有重复元素的大数据集中不同元素数量的任务。 LogLog和Hyperloglog（C.F. Durand和Flajolet [ESA 2003]，Flajolet等人[离散数学理论2007]）是针对基数估算的小空间素描方案，这些方案具有强大的理论保障性能，并且在实践中非常有效。这使它们成为一个非常流行的解决方案，该解决方案在大数据系统中的许多实现（例如代数，apache datasketches，bigquery，presto和redis）。然而，尽管具有简单而优雅的配方，但对日志和超置池的分析都非常参与 - 跨越了数十页的分析组合和复杂功能分析。我们提出对Loglog和Hyperloglog的修改，该修改用连续的牙龈分布代替离散的几何分布。这导致了对估计保证的非常简短，简单和基本的分析，以及估计器的行为更顺畅。

Cardinality estimation is the task of approximating the number of distinct elements in a large dataset with possibly repeating elements. LogLog and HyperLogLog (c.f. Durand and Flajolet [ESA 2003], Flajolet et al. [Discrete Math Theor. 2007]) are small space sketching schemes for cardinality estimation, which have both strong theoretical guarantees of performance and are highly effective in practice. This makes them a highly popular solution with many implementations in big-data systems (e.g. Algebird, Apache DataSketches, BigQuery, Presto and Redis). However, despite having simple and elegant formulation, both the analysis of LogLog and HyperLogLog are extremely involved -- spanning over tens of pages of analytic combinatorics and complex function analysis. We propose a modification to both LogLog and HyperLogLog that replaces discrete geometric distribution with a continuous Gumbel distribution. This leads to a very short, simple and elementary analysis of estimation guarantees, and smoother behavior of the estimator.

下载PDF全文

下载文献需遵守相关版权规定

论文标题