Bloom过滤器的流中概率的基数估计

论文标题

Bloom过滤器的流中概率的基数估计

In-stream Probabilistic Cardinality Estimation for Bloom Filters

论文作者

Scholler, Remy, Couchot, Jean-Francois, Alaoui-Ismaili, Oumaima, Renaud, Denis, Ballot, Eric

论文摘要

来自物联网传播者，社交网络，蜂窝网络等不同来源的数据量在过去几年中成倍增加。概率数据结构（PDS）是适用于大型数据处理和流媒体应用程序的确定性数据结构的有效替代方案。它们主要用于近似会员查询，频率计数，基数估计和相似性研究。在大型数据集或流数据中找到不同元素的数量是一个活跃的研究领域。在这项工作中，我们表明，基于BLOOM过滤器的常规方法平均而言相对准确，但差异很大。因此，降低这种差异很有趣，可以获得准确的统计数据。我们提出了一种概率方法，以更准确地估算Bloom过滤器的基于其参数的基数，即哈希函数$ k $，size $ m $和一个计数器$ s $，每当元素不在滤镜中时（即，当会员质量在此元素上的结果时）都会增加。由于Bloom过滤器的性质，计数器的价值永远不会大于确切的基数，但是哈希碰撞会导致其低估它。这会产生一个计数误差，我们可以准确地估算其在流中，并及其标准偏差。我们还讨论了一种基于其计数错误优化Bloom过滤器的参数的方法。我们通过通过对移动网络运营商提供的真实移动性数据集创建的合成数据来评估我们的方法，该分析以根据手机记录计算的位移矩阵的形式。此处提出的方法至少表现出色，并且差异的差异（约为6至7倍）比最新方法的状态较低。

The amount of data coming from different sources such as IoT-sensors, social networks, cellular networks, has increased exponentially during the last few years. Probabilistic Data Structures (PDS) are efficient alternatives to deterministic data structures suitable for large data processing and streaming applications. They are mainly used for approximate membership queries, frequency count, cardinality estimation and similarity research. Finding the number of distinct elements in a large dataset or in streaming data is an active research area. In this work, we show that usual methods based on Bloom filters for this kind of cardinality estimation are relatively accurate on average but have a high variance. Therefore, reducing this variance is interesting to obtain accurate statistics. We propose a probabilistic approach to estimate more accurately the cardinality of a Bloom filter based on its parameters, i.e., number of hash functions $k$, size $m$, and a counter $s$ which is incremented whenever an element is not in the filter (i.e., when the result of the membership query for this element is negative). The value of the counter can never be larger than the exact cardinality due to the Bloom filter's nature, but hash collisions can cause it to underestimate it. This creates a counting error that we estimate accurately, in-stream, along with its standard deviation. We also discuss a way to optimize the parameters of a Bloom filter based on its counting error. We evaluate our approach with synthetic data created from an analysis of a real mobility dataset provided by a mobile network operator in the form of displacement matrices computed from mobile phone records. The approach proposed here performs at least as well on average and has a much lower variance (about 6 to 7 times less) than state of the art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题