论文标题
加权的加权随机抽样
Weighted Random Sampling over Joins
论文作者
论文摘要
将记录与所有符合链接条件的记录一起加入记录可能会导致天文数量大量的组合,这是由于许多一对一的关系。对于这种具有挑战性的(无环)联接,对加入结果的随机样本是使用超大连接结果的实际替代方法。尽管先前的工作仅限于统一的联接采样,其中每个联接行都具有相同的概率,但该工作中的范围扩大到加权采样以支持新兴应用程序,例如观察数据中的科学发现和隐私保护查询的答案。尽管有一些幼稚的方法,但这项工作介绍了从结合结果加权随机抽样的第一种方法。由于缺乏基准,对各种联接类型和现实世界数据集进行了实验,以在同样概率的设置中使用基于主要内存指数的方法来显示大量的内存和竞争性能。与现有的统一抽样方法相反,这些方法需要占用有争议的资源的准备好的结构,以挤出稍微更快的查询时间,拟议的方法表现出在实践中急需的特质,即记忆足迹,流媒体操作,支持选择,选择,外部连接,半联接,半联盟和反式汇合和不平等的预处理。所有相关的代码和数据均可在以下网址找到:https://github.com/shekelyan/weightedjoinsampling
Joining records with all other records that meet a linkage condition can result in an astronomically large number of combinations due to many-to-many relationships. For such challenging (acyclic) joins, a random sample over the join result is a practical alternative to working with the oversized join result. Whereas prior works are limited to uniform join sampling where each join row is assigned the same probability, the scope is extended in this work to weighted sampling to support emerging applications such as scientific discovery in observational data and privacy-preserving query answering. Notwithstanding some naive methods, this work presents the first approach for weighted random sampling from join results. Due to a lack of baselines, experiments over various join types and real-world data sets are conducted to show substantial memory savings and competitive performance with main-memory index-based approaches in the equal-probability setting. In contrast to existing uniform sampling approaches that require prepared structures that occupy contested resources to squeeze out slightly faster query-times, the proposed approaches exhibit qualities that are urgently needed in practice, namely reduced memory footprint, streaming operation, support for selections, outer joins, semi joins and anti joins and unequal-probability sampling. All pertinent code and data can be found at: https://github.com/shekelyan/weightedjoinsampling