Neuromax：卷积神经网络的高通量，多线程，基于日志的加速器

论文标题

Neuromax：卷积神经网络的高通量，多线程，基于日志的加速器

NeuroMAX: A High Throughput, Multi-Threaded, Log-Based Accelerator for Convolutional Neural Networks

论文作者

Qureshi, Mahmood Azhar, Munir, Arslan

论文摘要

卷积神经网络（CNN）需要高通量硬件加速器，用于实时应用，这是由于其巨大的计算成本。大多数传统的CNN加速器依赖于单核，线性处理元件（PES）以及1D数据流以加速卷积操作。这限制了每个PE计数与统一的最大可达到的比率。过去的大多数都可以优化其数据流，以达到接近100％的硬件利用率以达到此比率。在本文中，我们引入了高吞吐量，多线程，基于日志的PE核心。与单个线性乘数PE核心相比，设计的核心每PE计数的峰值吞吐量增加了200％，而面积的峰值则增加了6％。我们还提供了2D权重广播数据流，该数据流利用PE内核的多线程性质，以实现各种CNN的高硬件利用率。我们称为Neuromax的整个体系结构在200 MHz处理时钟上以Xilinx Zynq 7020 SOC实现。与以前的FPGA和ASIC设计相比，对吞吐量，硬件利用率，区域和功率分解以及延迟进行详细分析，以显示性能提高。

Convolutional neural networks (CNNs) require high throughput hardware accelerators for real time applications owing to their huge computational cost. Most traditional CNN accelerators rely on single core, linear processing elements (PEs) in conjunction with 1D dataflows for accelerating convolution operations. This limits the maximum achievable ratio of peak throughput per PE count to unity. Most of the past works optimize their dataflows to attain close to a 100% hardware utilization to reach this ratio. In this paper, we introduce a high throughput, multi-threaded, log-based PE core. The designed core provides a 200% increase in peak throughput per PE count while only incurring a 6% increase in area overhead compared to a single, linear multiplier PE core with same output bit precision. We also present a 2D weight broadcast dataflow which exploits the multi-threaded nature of the PE cores to achieve a high hardware utilization per layer for various CNNs. The entire architecture, which we refer to as NeuroMAX, is implemented on Xilinx Zynq 7020 SoC at 200 MHz processing clock. Detailed analysis is performed on throughput, hardware utilization, area and power breakdown, and latency to show performance improvement compared to previous FPGA and ASIC designs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题