论文标题

在多云环境中表征BigBench查询,蜂巢和火花

Characterizing BigBench queries, Hive, and Spark in multi-cloud environments

论文作者

Poggi, Nicolas, Montero, Alejandro, Carrera, David

论文摘要

BigBench是用于基准测试和测试大数据系统的新标准(TPCX-BB)。 TPCX-BB规范描述了几种业务用例 - 查询 - 需要广泛组合数据提取技术,包括SQL,MAP/RELAD(M/R),用户代码(UDF)和机器学习以实现它们。但是,当前,对每个查询的不同资源需求和预期性能尚无广泛的了解,就像更建立的基准一样。同时,云提供商目前提供便利的按需托管大数据集群(PAAS),并采用付费模型。在PAAS中,诸如Hive和Spark之类的分析发动机可以使用通用配置和升级管理。该研究表征了BigBench的查询和云中火花和蜂巢版本的开箱即用性能。同时,从Azure Hdinsight,Amazon Web Services EMR和Google Cloud DataProc中比较流行的PAA产品,数据可伸缩性(1GB至10TB),版本和设置。查询表征突出了Hive a Spark Frameworks的相似性和差异,并且根据CPU,内存和I/O,查询是最多的资源。可伸缩性结果表明,随着数据量表的增长,大多数云提供程序中的配置调整是如何需要调整的,尤其是随着SparkS内存使用情况的增长。这些结果可以通过选择强调每个类别的查询的子集来帮助从业人员快速测试系统。同时,结果显示了PAAS中每种情况的蜂巢和火花比较以及每种表现如何。

BigBench is the new standard (TPCx-BB) for benchmarking and testing Big Data systems. The TPCx-BB specification describes several business use cases -- queries -- which require a broad combination of data extraction techniques including SQL, Map/Reduce (M/R), user code (UDF), and Machine Learning to fulfill them. However, currently, there is no widespread knowledge of the different resource requirements and expected performance of each query, as is the case to more established benchmarks. At the same time, cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Hive and Spark come ready to use, with a general-purpose configuration and upgrade management. The study characterizes both the BigBench queries and the out-of-the-box performance of Spark and Hive versions in the cloud. At the same time, comparing popular PaaS offerings in terms of reliability, data scalability (1GB to 10TB), versions, and settings from Azure HDinsight, Amazon Web Services EMR, and Google Cloud Dataproc. The query characterization highlights the similarities and differences in Hive an Spark frameworks, and which queries are the most resource consuming according to CPU, memory, and I/O. Scalability results show how there is a need for configuration tuning in most cloud providers as data scale grows, especially with Sparks memory usage. These results can help practitioners to quickly test systems by picking a subset of the queries which stresses each of the categories. At the same time, results show how Hive and Spark compare and what performance can be expected of each in PaaS.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源