论文标题
重新考虑云数据中心数据处理管道的存储管理
Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers
论文作者
论文摘要
从日志分析到DNN培训的数据准备,将数据处理框架(例如Apache Beam和Apache Spark)用于广泛的应用。因此,毫不奇怪,在优化这些框架(包括其存储管理)方面已经有大量工作。向云计算的转移需要在所有管道上进行优化,同时跨群集运行。在本文中,我们研究了此问题的一个特定实例:在SSD和HDD上的I/O密集型临时中间数据的放置。有效的数据放置是具有挑战性的,因为在需要放置数据时,I/O密度通常是未知的。此外,诸如负载可变性,职位优先或工作优先级之类的外部因素可能会影响工作完成时间,这最终会影响工作负载中临时文件的I/O密度。在本文中,我们设想可以使用机器学习来解决此问题。我们从Google的数据中心分析了一系列数据处理管道的生产日志。我们的分析表明,I/O密度可能是可以预测的。这表明,如果仔细制定基于学习的策略,可以提取参与各种转换的临时文件的I/O密度的预测功能,这些特征可用于提高数据处理管道中存储管理的效率。
Data processing frameworks such as Apache Beam and Apache Spark are used for a wide range of applications, from logs analysis to data preparation for DNN training. It is thus unsurprising that there has been a large amount of work on optimizing these frameworks, including their storage management. The shift to cloud computing requires optimization across all pipelines concurrently running across a cluster. In this paper, we look at one specific instance of this problem: placement of I/O-intensive temporary intermediate data on SSD and HDD. Efficient data placement is challenging since I/O density is usually unknown at the time data needs to be placed. Additionally, external factors such as load variability, job preemption, or job priorities can impact job completion times, which ultimately affect the I/O density of the temporary files in the workload. In this paper, we envision that machine learning can be used to solve this problem. We analyze production logs from Google's data centers for a range of data processing pipelines. Our analysis shows that I/O density may be predictable. This suggests that learning-based strategies, if crafted carefully, could extract predictive features for I/O density of temporary files involved in various transformations, which could be used to improve the efficiency of storage management in data processing pipelines.