论文标题

实施基于比较的外部排序

Implementing the Comparison-Based External Sort

论文作者

Polyntsov, Michael, Grigorev, Valentin, Smirnov, Kirill, Chernishev, George

论文摘要

在大数据时代,排序是DBMS和类似系统的必不可少的操作。进行数据排序可以帮助制作出运行时间明显较低的查询计划。它还可以提供其他好处,例如拥有将稳定产生数据(无爆发)的非阻滞操作员或具有减少内存足迹的操作员。 在查询处理的任何步骤中,都可能需要分类,即是源数据或中间结果。同时,要排序的数据可能不适合主内存。在这种情况下,应使用将中间结果写入磁盘的外部排序操作员。 在本文中,我们考虑了基于比较的排序类型的外部排序操作员。我们讨论其实施并描述相关的设计决策。我们的目的是研究合并步骤中使用的数据结构的性能的影响。为此,我们已经实验评估了DBM内实施的三个数据结构。 结果表明,即使是在通常是磁盘结合的现代商品计算机上,也值得努力实施有效的数据结构以进行合并。此外,我们证明,使用失败树是一种比天真的方法和基于堆的方法更有效的方法。

In the age of big data, sorting is an indispensable operation for DBMSes and similar systems. Having data sorted can help produce query plans with significantly lower run times. It also can provide other benefits like having non-blocking operators which will produce data steadily (without bursts), or operators with reduced memory footprint. Sorting may be required on any step of query processing, i.e., be it source data or intermediate results. At the same time, the data to be sorted may not fit into main memory. In this case, an external sort operator, which writes intermediate results to disk, should be used. In this paper we consider an external sort operator of the comparison-based sort type. We discuss its implementation and describe related design decisions. Our aim is to study the impact on performance of a data structure used on the merge step. For this, we have experimentally evaluated three data structures implemented inside a DBMS. Results have shown that it is worthwhile to make an effort to implement an efficient data structure for run merging, even on modern commodity computers which are usually disk-bound. Moreover, we demonstrated that using a loser tree is a more efficient approach than both the naive approach and the heap-based one.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源