预测;不要为在GPU中启用有效的细粒DVF做出反应

论文标题

预测;不要为在GPU中启用有效的细粒DVF做出反应

Predict; Do not React for Enabling Efficient Fine Grain DVFS in GPUs

论文作者

Bharadwaj, Srikant, Das, Shomit, Mazumdar, Kaushik, Beckmann, Bradford, Kosonocky, Stephen

论文摘要

随着芯片集成电压调节器（IVRS）的持续改进和快速的自适应频率控制，动态电压频率缩放（DVFS）过渡时间已从微秒到纳米秒状态缩小，从而提供了额外的机会来提高能源效率。释放电压频率电路技术持续改进的关键是创建新的，更智能的DVFS机制，这些机制可以更好地适应工作量需求的快速波动。随着芯片在数据中心中越来越重要的工作主场，优化图形处理单元（GPU）的细粒DVFS机制（GPU）尤为重要。但是，GPU中的大量螺纹级并行性使得在运行时确定最佳的电压频率状态非常困难。现有的解决方案是针对单线读取的CPU和更长的时间尺度而设计的，以考虑GPU工作负载在短时间尺度上看似混乱的，高度不同的性质。本文提出了一种新型的预测机制PCSTALL，该机制是针对GPU中新兴DVFS功能而定制的，并实现了近乎最佳的能源效率。利用我们的细粒工作负载分析中的见解，我们提出了一个基于波前程序计数器（PC）的DVFS机制，该机制在1 microsecond dvfs时间时期的一组GPU应用程序中，将程序的行为预测准确性平均提高了32％。与当前的艺术品相比，我们基于PC的技术在50微秒时期优化了能量 - 延迟平方的产品时，平均提高了19％，当使用1微秒DVFS技术运行时，功率效率达到32％。

With the continuous improvement of on-chip integrated voltage regulators (IVRs) and fast, adaptive frequency control, dynamic voltage-frequency scaling (DVFS) transition times have shrunk from the microsecond to the nanosecond regime, providing additional opportunities to improve energy efficiency. The key to unlocking the continued improvement in voltage-frequency circuit technology is the creation of new, smarter DVFS mechanisms that better adapt to rapid fluctuations in workload demand. It is particularly important to optimize fine-grain DVFS mechanisms for graphics processing units (GPUs) as the chips become ever more important workhorses in the datacenter. However, massive amount of thread-level parallelism in GPUs makes it uniquely difficult to determine the optimal voltage-frequency state at run-time. Existing solutions-mostly designed for single-threaded CPUs and longer time scales-fail to consider the seemingly chaotic, highly varying nature of GPU workloads at short time scales. This paper proposes a novel prediction mechanism, PCSTALL, that is tailored for emerging DVFS capabilities in GPUs and achieves near-optimal energy efficiency. Using the insights from our fine-grained workload analysis, we propose a wavefront-level program counter (PC) based DVFS mechanism that improves program behavior prediction accuracy by 32% on average for a wide set of GPU applications at 1 microsecond DVFS time epochs. Compared to the current state-of-art, our PC-based technique achieves 19% average improvement when optimized for Energy-Delay-Squared Product at 50 microsecond time epochs, reaching 32% power efficiencies when operated with 1 microsecond DVFS technologies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题