贝escard：修订贝叶斯框架以进行基数估算

论文标题

贝escard：修订贝叶斯框架以进行基数估算

BayesCard: Revitilizing Bayesian Frameworks for Cardinality Estimation

论文作者

Wu, Ziniu, Shaikhha, Amir, Zhu, Rong, Zeng, Kai, Han, Yuxing, Zhou, Jingren

论文摘要

基数估计（CARDEST）是查询优化器中的重要组成部分，并且是DBMS中的基本问题。所需的卡最佳方法应达到良好的算法性能，对各种数据设置保持稳定，并且对系统部署友好。但是，没有现有的Cardest方法可以同时满足这三个标准。传统方法通常具有明显的算法缺陷，例如大估计错误。最近提出的基于深度学习的方法在很大程度上提高了估计准确性，但是它们的性能可能会受到数据的极大影响，并且通常很难进行系统部署。在本文中，我们通过结合概率编程语言的技术来振兴贝叶斯网络（BN）的CARDEST。我们提出贝escard，这是第一个继承BN的优势，即高估计准确性和解释性的框架，同时克服了它们的缺点，即低结构学习和推理效率。这使贝escard成为商业DBMS部署的理想选择。我们对几个单台和多陈型基准测试的实验结果表明，贝escard优于现有的最新纸牌方法：贝叶斯卡实现可比性或更高的准确性，1-2个数量级的提高时间更快，1-3个订单更快的训练时间更快，训练时间更快，1-3个订单订单较小的型号大小，1-2个型号尺寸，1-2个订单较小的型号和1-2个订单更新。同时，贝escard在不同的设置变化数据时保持稳定的性能。我们还将贝escard部署到Postgresql中。在IMDB基准工作负载上，它将端到端的查询时间提高了13.3％，它非常接近使用真实基数的Oracle的14.2％的最佳结果。

Cardinality estimation (CardEst) is an essential component in query optimizers and a fundamental problem in DBMS. A desired CardEst method should attain good algorithm performance, be stable to varied data settings, and be friendly to system deployment. However, no existing CardEst method can fulfill the three criteria at the same time. Traditional methods often have significant algorithm drawbacks such as large estimation errors. Recently proposed deep learning based methods largely improve the estimation accuracy but their performance can be greatly affected by data and often difficult for system deployment. In this paper, we revitalize the Bayesian networks (BN) for CardEst by incorporating the techniques of probabilistic programming languages. We present BayesCard, the first framework that inherits the advantages of BNs, i.e., high estimation accuracy and interpretability, while overcomes their drawbacks, i.e. low structure learning and inference efficiency. This makes BayesCard a perfect candidate for commercial DBMS deployment. Our experimental results on several single-table and multi-table benchmarks indicate BayesCard's superiority over existing state-of-the-art CardEst methods: BayesCard achieves comparable or better accuracy, 1-2 orders of magnitude faster inference time, 1-3 orders faster training time, 1-3 orders smaller model size, and 1-2 orders faster updates. Meanwhile, BayesCard keeps stable performance when varying data with different settings. We also deploy BayesCard into PostgreSQL. On the IMDB benchmark workload, it improves the end-to-end query time by 13.3%, which is very close to the optimal result of 14.2% using an oracle of true cardinality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题