扩展收缩阵列

论文标题

扩展收缩阵列

Scale-out Systolic Arrays

论文作者

Yüzügüler, Ahmet Caner, Sönmez, Canberk, Drumond, Mario, Oh, Yunho, Falsafi, Babak, Frossard, Pascal

论文摘要

多pod收缩期阵列正在成为DNN推理加速器中选择的架构。尽管它们具有潜力，但设计多pod收缩期阵列以最大化有效的吞吐量/瓦特（即，在考虑阵列利用率时调整了吞吐量/瓦特）会带来一系列独特的挑战。在这项工作中，我们研究了多pod收缩期阵列设计的三个关键支柱，即阵列粒度，互连和平铺。我们确定跨工作负载的最佳阵列粒度，并表明最先进的商业加速器使用次优阵列大小来实现单个租赁工作负载。然后，我们在互连中评估带宽/延迟权衡，并表明蝴蝶网络为具有大量POD的加速器提供了可扩展的拓扑结构。最后，我们引入了一种具有自定义分区大小的新型数据瓷砖方案，以最大程度地利用最佳尺寸的豆荚。我们提出了扩展收缩期阵列，这是基于这三个支柱的单个和多租赁的多pod推理加速器。我们表明，SOSA在最先进的DNN推理工作负载方面表现出高达600台TeraOps/s的缩放，并且超过最先进的多pod加速器的缩放比例为1.5倍。

Multi-pod systolic arrays are emerging as the architecture of choice in DNN inference accelerators. Despite their potential, designing multi-pod systolic arrays to maximize effective throughput/Watt (i.e., throughput/Watt adjusted when accounting for array utilization) poses a unique set of challenges. In this work, we study three key pillars in multi-pod systolic array designs, namely array granularity, interconnect, and tiling. We identify optimal array granularity across workloads and show that state-of-the-art commercial accelerators use suboptimal array sizes for single-tenancy workloads. We, then evaluate the bandwidth/latency trade-offs in interconnects and show that Butterfly networks offer a scalable topology for accelerators with a large number of pods. Finally, we introduce a novel data tiling scheme with custom partition size to maximize utilization in optimally sized pods. We propose Scale-out Systolic Arrays, a multi-pod inference accelerator for both single- and multi-tenancy based on these three pillars. We show that SOSA exhibits scaling of up to 600 TeraOps/s in effective throughput for state-of-the-art DNN inference workloads, and outperforms state-of-the-art multi-pod accelerators by a factor of 1.5x.

下载PDF全文

下载文献需遵守相关版权规定

论文标题