PSPIN：灵活网络内计算的高性能低功率体系结构

论文标题

PSPIN：灵活网络内计算的高性能低功率体系结构

PsPIN: A high-performance low-power architecture for flexible in-network compute

论文作者

Di Girolamo, Salvatore, Kurth, Andreas, Calotoiu, Alexandru, Benz, Thomas, Schneider, Timo, Beránek, Jakub, Benini, Luca, Hoefler, Torsten

论文摘要

将数据和控制任务卸载到网络的能力变得越来越重要，尤其是如果与CPU频率相比，我们考虑了网络速度的增长速度。网络内计算通过直接在网络中运行任务，从而减轻主机CPU负载，从而实现其他计算/通信重叠并有可能改善整体应用程序性能。但是，下一代网络提供的持续带宽，例如400 Gbit/s可能会成为一个挑战。旋转是用于INIC计算的编程模型，用户指定在NIC上执行的处理程序功能，对于属于给定消息或流程的每个传入数据包。它启用了类似CUDA的加速度，其中NIC配备了并行处理网络数据包的轻质处理元素。我们研究了旋转NIC应提供的建筑专业，以实现高性能，低功率和灵活的数据包处理。我们介绍了PSPIN，这是一种基于多群集RISC-V架构，并根据已确定的体系结构专业设计的第一个开源旋转实现。我们使用周期精确的模拟研究了PSPIN的性能，表明它可以在几种用例中以400 Gbit/s的速度处理数据包，从而引入最小的潜伏期（26 ns的64 B数据包），并占据了18.5 mm 2（22 nm FDSOI）的总面积。

The capacity of offloading data and control tasks to the network is becoming increasingly important, especially if we consider the faster growth of network speed when compared to CPU frequencies. In-network compute alleviates the host CPU load by running tasks directly in the network, enabling additional computation/communication overlap and potentially improving overall application performance. However, sustaining bandwidths provided by next-generation networks, e.g., 400 Gbit/s, can become a challenge. sPIN is a programming model for in-NIC compute, where users specify handler functions that are executed on the NIC, for each incoming packet belonging to a given message or flow. It enables a CUDA-like acceleration, where the NIC is equipped with lightweight processing elements that process network packets in parallel. We investigate the architectural specialties that a sPIN NIC should provide to enable high-performance, low-power, and flexible packet processing. We introduce PsPIN, a first open-source sPIN implementation, based on a multi-cluster RISC-V architecture and designed according to the identified architectural specialties. We investigate the performance of PsPIN with cycle-accurate simulations, showing that it can process packets at 400 Gbit/s for several use cases, introducing minimal latencies (26 ns for 64 B packets) and occupying a total area of 18.5 mm 2 (22 nm FDSOI).

下载PDF全文

下载文献需遵守相关版权规定

论文标题