收缩-CNN：一个opencl定义的可扩展运行时fpga fpga加速器架构，用于加速卷积神经网络推断云/边缘计算中的卷积神经网络推断

论文标题

收缩-CNN：一个opencl定义的可扩展运行时fpga fpga加速器架构，用于加速卷积神经网络推断云/边缘计算中的卷积神经网络推断

Systolic-CNN: An OpenCL-defined Scalable Run-time-flexible FPGA Accelerator Architecture for Accelerating Convolutional Neural Network Inference in Cloud/Edge Computing

论文作者

Dua, Akshay, Li, Yixing, Ren, Fengbo

论文摘要

本文介绍了收缩-CNN，这是一种OPENCL定义的可扩展，运行时弹性的FPGA加速器体系结构，优化了用于加速多种卷积神经网络（CNN）在多范围云/边缘计算中的推断。 CNN推断的现有OPENCL定义的FPGA加速器由于灵活性有限而无法在运行时支持多个CNN模型，而可扩展性较差，导致FPGA资源不足和计算并行性有限。收缩-CNN采用了高度管道和平行的1-D收缩阵列结构，该体系结构有效地探索了在FPGA上加速CNN推断的空间和时间并行性。收缩-CNN具有高度可扩展性和参数化，用户可以轻松适应以达到给定FPGA的粗粒度计算资源（即DSP块）的100％利用。在多租赁云/边缘计算的上下文中，收缩-CNN也可以运行时间flimixlex，可以在运行时加速各种CNN模型，而无需重新编译FPGA内核硬件或重新编程FPGA。 The experiment results based on an Intel Arria/Stratix 10 GX FPGA Development board show that the optimized single-precision implementation of Systolic-CNN can achieve an average inference latency of 7ms/2ms, 84ms/33ms, 202ms/73ms, 1615ms/873ms, and 900ms/498ms per image for accelerating AlexNet, ResNet-50, ResNet-152,视网膜和轻质视网膜。代码可在https://github.com/psclab-asu/systolic-cnn上找到。

This paper presents Systolic-CNN, an OpenCL-defined scalable, run-time-flexible FPGA accelerator architecture, optimized for accelerating the inference of various convolutional neural networks (CNNs) in multi-tenancy cloud/edge computing. The existing OpenCL-defined FPGA accelerators for CNN inference are insufficient due to limited flexibility for supporting multiple CNN models at run time and poor scalability resulting in underutilized FPGA resources and limited computational parallelism. Systolic-CNN adopts a highly pipelined and paralleled 1-D systolic array architecture, which efficiently explores both spatial and temporal parallelism for accelerating CNN inference on FPGAs. Systolic-CNN is highly scalable and parameterized, which can be easily adapted by users to achieve up to 100% utilization of the coarse-grained computation resources (i.e., DSP blocks) for a given FPGA. Systolic-CNN is also run-time-flexible in the context of multi-tenancy cloud/edge computing, which can be time-shared to accelerate a variety of CNN models at run time without the need of recompiling the FPGA kernel hardware nor reprogramming the FPGA. The experiment results based on an Intel Arria/Stratix 10 GX FPGA Development board show that the optimized single-precision implementation of Systolic-CNN can achieve an average inference latency of 7ms/2ms, 84ms/33ms, 202ms/73ms, 1615ms/873ms, and 900ms/498ms per image for accelerating AlexNet, ResNet-50, ResNet-152, RetinaNet, and Light-weight RetinaNet, respectively. Codes are available at https://github.com/PSCLab-ASU/Systolic-CNN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题