论文标题

迈向快速的theta-join:预滤和合并分区方法

Towards Fast Theta-join: A Prefiltering and Amalgamated Partitioning Approach

论文作者

Wu, Jiashu, Wang, Yang, Fan, Xiaopeng, Ye, Kejiang, Xu, Chengzhong

论文摘要

作为最有用的在线处理技术之一,许多应用程序已经利用了Theta-Join操作来完全挖掘各种情况下数据流之间的关系。因此,已经进行了不断的研究工作,以优化其在分布式环境中的性能,通常其特征是尽可能减少笛卡尔产品的数量。在本文中,我们通过开发两种不同的技术来设计和实施一种名为Prefap的快速快速theta-join算法 - 基于最先进的FastThetaJoin Join Join Join算法以优化Theta-Join操作的效率。首先,在对数据流进行分区之前,我们制定了一种预滤波策略,以减少所参与的数据量并使更细粒度的分区受益。其次,为了避免以粗粒隔离的方式对数据流进行分区并提高分区级过滤的质量,我们引入了一种合并的分区机制,该机制可以将两个数据流的分区边界合并,以帮助细分分区。通过将这两种技术集成到现有的FastThetaJoin算法中,我们设计和实施了一个新框架,以实现降低的笛卡尔产品和较高的Theta-Join效率。通过与现有算法进行比较,尤其是FastThetajoin,我们评估了从双向到多路theta-join的合成和真实数据流对综合和真实数据流的性能,以证明其优越性。

As one of the most useful online processing techniques, the theta-join operation has been utilized by many applications to fully excavate the relationships between data streams in various scenarios. As such, constant research efforts have been put to optimize its performance in the distributed environment, which is typically characterized by reducing the number of Cartesian products as much as possible. In this article, we design and implement a novel fast theta-join algorithm, called Prefap, by developing two distinct techniques - prefiltering and amalgamated partitioning-based on the state-of-the-art FastThetaJoin algorithm to optimize the efficiency of the theta-join operation. Firstly, we develop a prefiltering strategy before data streams are partitioned to reduce the amount of data to be involved and benefit a more fine-grained partitioning. Secondly, to avoid the data streams being partitioned in a coarse-grained isolated manner and improve the quality of the partition-level filtering, we introduce an amalgamated partitioning mechanism that can amalgamate the partitioning boundaries of two data streams to assist a fine-grained partitioning. With the integration of these two techniques into the existing FastThetaJoin algorithm, we design and implement a new framework to achieve a decreased number of Cartesian products and a higher theta-join efficiency. By comparing with existing algorithms, FastThetaJoin in particular, we evaluate the performance of Prefap on both synthetic and real data streams from two-way to multiway theta-join to demonstrate its superiority.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源