Anton 3上的专门高性能网络

论文标题

Anton 3上的专门高性能网络

The Specialized High-Performance Network on Anton 3

论文作者

Shim, Keun Sup, Greskamp, Brian, Towles, Brian, Edwards, Bruce, Grossman, J. P., Shaw, David E.

论文摘要

分子动力学（MD）模拟，一种计算密集型方法，可对生物分子的行为提供宝贵的见解，通常需要大规模平行化。快速平行MD模拟的实现需要高带宽和低潜伏期来进行节点通信，但是在当前的半导体技术中，这些属性都不像节点内计算能力那样快地扩展。扩展的这种差异需要架构创新，以最大程度地利用计算单元的利用。对于Anton 3，这是一个为MD模拟设计的非常成功的特殊用途超级计算机的最新家族，因此我们设计并构建了一个全新的专用网络，作为我们的ASIC的一部分。将该网络与专门的计算管道紧密整合，使安东3可以比任何通用的超级计算机快速执行模拟订单，并通过一个数量级以优于其前身安东2（Anton 3之前的先进状态）。在本文中，我们介绍了网络的三个关键特征，这些特征有助于安东3的高性能。首先，通过架构优化，该网络实现了非常低的端到端节点交流延迟，从而可以更好地重叠计算和通信。其次，新型应用特定的压缩技术减少了节点之间发送的大多数消息的大小，从而增加了有效的节点间带宽。最后，一种称为网络围栏的新硬件同步原始性，支持对并行MD应用程序中数据流量量身定制的快速细粒度同步。这些对网络的应用程序驱动的专业知识对于Anton 3的MD模拟性能优势比所有其他机器都至关重要。

Molecular dynamics (MD) simulation, a computationally intensive method that provides invaluable insights into the behavior of biomolecules, typically requires large-scale parallelization. Implementation of fast parallel MD simulation demands both high bandwidth and low latency for inter-node communication, but in current semiconductor technology, neither of these properties is scaling as quickly as intra-node computational capacity. This disparity in scaling necessitates architectural innovations to maximize the utilization of computational units. For Anton 3, the latest in a family of highly successful special-purpose supercomputers designed for MD simulations, we thus designed and built a completely new specialized network as part of our ASIC. Tightly integrating this network with specialized computation pipelines enables Anton 3 to perform simulations orders of magnitude faster than any general-purpose supercomputer, and to outperform its predecessor, Anton 2 (the state of the art prior to Anton 3), by an order of magnitude. In this paper, we present the three key features of the network that contribute to the high performance of Anton 3. First, through architectural optimizations, the network achieves very low end-to-end inter-node communication latency for fine-grained messages, allowing for better overlap of computation and communication. Second, novel application-specific compression techniques reduce the size of most messages sent between nodes, thereby increasing effective inter-node bandwidth. Lastly, a new hardware synchronization primitive, called a network fence, supports fast fine-grained synchronization tailored to the data flow within a parallel MD application. These application-driven specializations to the network are critical for Anton 3's MD simulation performance advantage over all other machines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题