同时GPU内核的实证兼统计方法的绩效表征

论文标题

同时GPU内核的实证兼统计方法的绩效表征

An Empirical-cum-Statistical Approach to Power-Performance Characterization of Concurrent GPU Kernels

论文作者

Goswami, Nilanjan, Qouneh, Amer, Li, Chao, Li, Tao

论文摘要

数据中心中功率和能源效率吞吐量加速器（GPU）的部署不断扩大，需要增强GPU的功率绩效合作能力。使用加速器实现Exascale计算需要进一步提高功率效率。通过加速器中的硬接线内核并发启用，工间和工具载荷同时的内核计算预测，在较低的能量预算下，吞吐量会增加。为了提高体系结构的每瓦绩效指标，需要对现实世界吞吐量工作负载进行系统的经验研究（具有同时的内核执行）。为此，我们提出了一个多内核吞吐量工作负载生成框架，该框架将促进Exascale数据中心的积极能量和绩效管理，并将刺激吞吐量体系结构的协同功能绩效的合法化。此外，我们基于框架，基于框架，该框架封装了对称，非对称和共存（通常出现在一起）基于内核的工作负载。平均而言，我们的分析表明，吞吐量体系结构中内核执行中的空间和时间并发可在GTX470，Tesla M2050和Tesla K20中节省32％，26％和33％的能耗。并发和增强利用率通常是相关的，但并不意味着功率耗散的显着偏差。对拟议的多内核的多样性分析证实了套件内的特征变化和功率培训多样性。此外，我们解释了关于同时吞吐量工作负载的功率绩效合作的几个发现。

Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers demands enhancement of power-performance co-optimization capabilities of GPUs. Realization of exascale computing using accelerators requires further improvements in power efficiency. With hardwired kernel concurrency enablement in accelerators, inter- and intra-workload simultaneous kernels computation predicts increased throughput at lower energy budget. To improve Performance-per-Watt metric of the architectures, a systematic empirical study of real-world throughput workloads (with concurrent kernel execution) is required. To this end, we propose a multi-kernel throughput workload generation framework that will facilitate aggressive energy and performance management of exascale data centers and will stimulate synergistic power-performance co-optimization of throughput architectures. Also, we demonstrate a multi-kernel throughput benchmark suite based on the framework that encapsulates symmetric, asymmetric and co-existing (often appears together) kernel based workloads. On average, our analysis reveals that spatial and temporal concurrency within kernel execution in throughput architectures saves energy consumption by 32%, 26% and 33% in GTX470, Tesla M2050 and Tesla K20 across 12 benchmarks. Concurrency and enhanced utilization are often correlated but do not imply significant deviation in power dissipation. Diversity analysis of proposed multi-kernels confirms characteristic variation and power-profile diversity within the suite. Besides, we explain several findings regarding power-performance co-optimization of concurrent throughput workloads.

下载PDF全文

下载文献需遵守相关版权规定

论文标题