利用合奏学习进行性能和功率建模以及平行癌症深度学习蜡烛基准的改进

论文标题

利用合奏学习进行性能和功率建模以及平行癌症深度学习蜡烛基准的改进

Utilizing Ensemble Learning for Performance and Power Modeling and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks

论文作者

Wu, Xingfu, Taylor, Valerie

论文摘要

机器学习（ML）几乎在几乎所有域中都在重要性增长，并且是模拟数据的自然工具。通常，模型最大程度地减少偏见和差异的能力之间存在权衡。在本文中，我们利用集合学习将线性，非线性和基于树/规则的ML方法结合在一起，以应对偏见变化的权衡，并导致更准确的模型。硬件性能计数器值与影响基础系统影响性能和功率的应用程序的属性相关。我们将收集的数据集用于两个平行的癌症深度学习蜡烛基准NT3（弱缩放）和P1B2（较强的缩放），以使用单对象和多个目标合奏学习来构建基于硬件性能计数器的性能和功率模型，以确定最重要的计数器。基于这些模型的见解，我们通过优化深度学习环境Tensorflow，Keras，Horovod和Python来提高P1B2和NT3的性能和能量，在Argonne国家实验室的Cray XC40 Theta上的巨大页面大小的巨大页面大小。实验结果表明，合奏学习不仅会产生更准确的模型，而且还提供了更强大的性能柜台排名。我们实现高达61.15％的绩效提高，P1B2的能源节省高达62.58％，高达55.81％的性能提高，最高可在24,576个内核中为NT3节省52.60％。

Machine learning (ML) continues to grow in importance across nearly all domains and is a natural tool in modeling to learn from data. Often a tradeoff exists between a model's ability to minimize bias and variance. In this paper, we utilize ensemble learning to combine linear, nonlinear, and tree-/rule-based ML methods to cope with the bias-variance tradeoff and result in more accurate models. Hardware performance counter values are correlated with properties of applications that impact performance and power on the underlying system. We use the datasets collected for two parallel cancer deep learning CANDLE benchmarks, NT3 (weak scaling) and P1B2 (strong scaling), to build performance and power models based on hardware performance counters using single-object and multiple-objects ensemble learning to identify the most important counters for improvement. Based on the insights from these models, we improve the performance and energy of P1B2 and NT3 by optimizing the deep learning environments TensorFlow, Keras, Horovod, and Python under the huge page size of 8 MB on the Cray XC40 Theta at Argonne National Laboratory. Experimental results show that ensemble learning not only produces more accurate models but also provides more robust performance counter ranking. We achieve up to 61.15% performance improvement and up to 62.58% energy saving for P1B2 and up to 55.81% performance improvement and up to 52.60% energy saving for NT3 on up to 24,576 cores.

下载PDF全文

下载文献需遵守相关版权规定

论文标题