论文标题
BinArray:用于二进制近似CNN的可扩展硬件加速器
BinArray: A Scalable Hardware Accelerator for Binary Approximated CNNs
论文作者
论文摘要
深卷积神经网络(CNN)由于其出色的准确性,已成为计算机视觉和其他信号处理任务的最先进。近年来,为了在低功率嵌入式设备上实现实时操作,已做出了巨大的努力来降低CNN的计算成本。 为了实现这一目标,我们提出了BinArray,这是一个定制的硬件加速器,用于具有二进制重量的CNN。本文使用的二进制近似是[1]中最初建议的网络压缩技术的改进版本。它大大减少了推断所需的乘法数量,而准确性降解也很少或很少。 BinArray可以轻松缩放,并允许通过对用户透明的三个设计参数在硬件资源使用和吞吐量之间妥协。此外,在运行时可以动态地选择高精度或吞吐量。 BinArray已在寄存器传输级别进行了优化,并在400 MHz作为指令集处理器在异质XC7Z045-2 FPGA-SOC平台内运行。 实验结果表明,对于不同网络大小的Edgetpu [2](例如Edgetpu [2]),BinArray量表符合其他加速器的性能。即使对于最大的Mobilenet,仅使用目标设备的50%,仅使用96个DSP块。
Deep Convolutional Neural Networks (CNNs) have become state-of-the art for computer vision and other signal processing tasks due to their superior accuracy. In recent years, large efforts have been made to reduce the computational costs of CNNs in order to achieve real-time operation on low-power embedded devices. Towards this goal we present BinArray, a custom hardware accelerator for CNNs with binary approximated weights. The binary approximation used in this paper is an improved version of a network compression technique initially suggested in [1]. It drastically reduces the number of multiplications required per inference with no or very little accuracy degradation. BinArray easily scales and allows to compromise between hardware resource usage and throughput by means of three design parameters transparent to the user. Furthermore, it is possible to select between high accuracy or throughput dynamically during runtime. BinArray has been optimized at the register transfer level and operates at 400 MHz as instruction-set processor within a heterogenous XC7Z045-2 FPGA-SoC platform. Experimental results show that BinArray scales to match the performance of other accelerators like EdgeTPU [2] for different network sizes. Even for the largest MobileNet only 50% of the target device and only 96 DSP blocks are utilized.