通过GPU意识异步任务提高可伸缩性

论文标题

通过GPU意识异步任务提高可伸缩性

Improving Scalability with GPU-Aware Asynchronous Tasks

论文作者

Choi, Jaemin, Richards, David F., Kale, Laxmikant V.

论文摘要

异步任务，当用过度分类创建时，启用自动计算通信重叠，这可以大大提高性能和可伸缩性。这不仅适用于传统的基于CPU的系统，还适用于现代GPU加速平台。尽管在弱缩放场景中隐藏沟通的能力可以非常有效，但由于精细粒度的开销和重叠的空间，性能开始遭受较小的问题大小或较强的缩放尺寸折磨。在这项工作中，除了计算通信重叠之外，我们将GPU-Aware通信整合到异步任务中，目的是减少通信中花费的时间并进一步增加GPU利用率。我们使用执行Jacobi迭代方法Jacobi3d的代理应用程序证明了方法的性能影响。除了优化以最大程度地减少主机和GPU设备之间的同步并增加GPU操作的并发性外，我们还探索了诸如内核融合和CUDA图之类的技术，以减轻大规模粒度的高架开销。

Asynchronous tasks, when created with over-decomposition, enable automatic computation-communication overlap which can substantially improve performance and scalability. This is not only applicable to traditional CPU-based systems, but also to modern GPU-accelerated platforms. While the ability to hide communication behind computation can be highly effective in weak scaling scenarios, performance begins to suffer with smaller problem sizes or in strong scaling due to fine-grained overheads and reduced room for overlap. In this work, we integrate GPU-aware communication into asynchronous tasks in addition to computation-communication overlap, with the goal of reducing time spent in communication and further increasing GPU utilization. We demonstrate the performance impact of our approach using a proxy application that performs the Jacobi iterative method, Jacobi3D. In addition to optimizations to minimize synchronizations between the host and GPU devices and increase the concurrency of GPU operations, we explore techniques such as kernel fusion and CUDA Graphs to mitigate fine-grained overheads at scale.

下载PDF全文

下载文献需遵守相关版权规定

论文标题