比较重要的融合HPC代码基准的单节点和多节点性能

论文标题

比较重要的融合HPC代码基准的单节点和多节点性能

Comparing single-node and multi-node performance of an important fusion HPC code benchmark

论文作者

Belli, Emily A., Candy, Jeff, Sfiligoi, Igor, Würthwein, Frank

论文摘要

传统上，融合模拟需要使用领导力量表高性能计算（HPC）资源，以实现物理学的进步。现在，许多GPU计算节点的计算和内存能力的令人印象深刻的改进允许一些问题，这些问题曾经需要在单个节点上可以解决多节点设置。在可能的情况下，增加的互连带宽可能会导致更高的科学吞吐量，尤其是对于沟通较重的应用。在本文中，我们分析了融合模拟工具CGYRO的性能，这是一种为碰撞，电磁，多尺度模拟设计和优化的欧拉陀螺仪湍流求解器，在融合研究社区中广泛使用。由于问题的性质，该应用程序必须整体上处理大型多维计算网格，需要在计算过程之间频繁交换大量数据。特别是，我们表明，平均尺度的NL03基准CGYO模拟可以在单个Google云实例上以可接受的速度运行，其中16个A100 GPU，表现优于8个NERSC Perlmutter Phase1节点，16个ORNL Summit节点和256个NERSC Cori Nodes。从多节点到单节点GPU设置，我们使用少于GPU的一半来获得可比较的仿真时间。但是，由于GPU内存容量的需求，较大的基准问题仍然需要多节点HPC设置，因为在撰写时，没有供应商提供具有足够的GPU内存设置的节点。然而，即将到来的外部NVSWWITCH确实有望提供几乎等效的解决方案，可用于256个NVIDIA GPU。

Fusion simulations have traditionally required the use of leadership scale High Performance Computing (HPC) resources in order to produce advances in physics. The impressive improvements in compute and memory capacity of many-GPU compute nodes are now allowing for some problems that once required a multi-node setup to be also solvable on a single node. When possible, the increased interconnect bandwidth can result in order of magnitude higher science throughput, especially for communication-heavy applications. In this paper we analyze the performance of the fusion simulation tool CGYRO, an Eulerian gyrokinetic turbulence solver designed and optimized for collisional, electromagnetic, multiscale simulation, which is widely used in the fusion research community. Due to the nature of the problem, the application has to work on a large multi-dimensional computational mesh as a whole, requiring frequent exchange of large amounts of data between the compute processes. In particular, we show that the average-scale nl03 benchmark CGYRO simulation can be run at an acceptable speed on a single Google Cloud instance with 16 A100 GPUs, outperforming 8 NERSC Perlmutter Phase1 nodes, 16 ORNL Summit nodes and 256 NERSC Cori nodes. Moving from a multi-node to a single-node GPU setup we get comparable simulation times using less than half the number of GPUs. Larger benchmark problems, however, still require a multi-node HPC setup due to GPU memory capacity needs, since at the time of writing no vendor offers nodes with a sufficient GPU memory setup. The upcoming external NVSWITCH does however promise to deliver an almost equivalent solution for up to 256 NVIDIA GPUs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题