论文标题

Presto中的元数据缓存:朝着快速数据处理

Metadata Caching in Presto: Towards Fast Data Processing

论文作者

Wang, Beinan, Tang, Chunxu, Zhong, Rongrong, Fan, Bin, Wang, Yi, Wang, Jasmine, Chen, Shouwei, Ding, Bowen, Zhang, Lu

论文摘要

Presto是OLAP的开源分布式SQL查询引擎,旨在“一切SQL”。自2013年开放源以来,Presto一直在大规模数据分析中始终广受欢迎,并吸引了各种企业的采用。从Presto的开发和运营中,我们在Presto Worker节点中的解析列数据文件中见证了大量的CPU消耗。这阻止了包括元数据在内的一些公司增加分析数据量。 在本文中,我们提出了一个元数据缓存层,该缓存层构建在Alluxio SDK缓存的顶部,并在每个Presto Worker节点中掺入,以缓存中间的结果。元数据缓存提供了两种缓存方法:从原始数据文件中缓存解压缩的元数据字节并缓存供应元数据对象。我们对PRESTO上TPC-DS基准测试的评估表明,当缓存温暖时,第一种方法可以将查询的CPU消耗降低10%-20%,而第二种方法可以将CPU使用最小化20%-40%。

Presto is an open-source distributed SQL query engine for OLAP, aiming for "SQL on everything". Since open-sourced in 2013, Presto has been consistently gaining popularity in large-scale data analytics and attracting adoption from a wide range of enterprises. From the development and operation of Presto, we witnessed a significant amount of CPU consumption on parsing column-oriented data files in Presto worker nodes. This blocks some companies, including Meta, from increasing analytical data volumes. In this paper, we present a metadata caching layer, built on top of the Alluxio SDK cache and incorporated in each Presto worker node, to cache the intermediate results in file parsing. The metadata cache provides two caching methods: caching the decompressed metadata bytes from raw data files and caching the deserialized metadata objects. Our evaluation of the TPC-DS benchmark on Presto demonstrates that when the cache is warm, the first method can reduce the query's CPU consumption by 10%-20%, whereas the second method can minimize the CPU usage by 20%-40%.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源