论文标题
facnatun4:一种硬件软件共同设计方法,用于有效运行4bit-corpact多层求解器
FantastIC4: A Hardware-Software Co-Design Approach for Efficiently Running 4bit-Compact Multilayer Perceptrons
论文作者
论文摘要
随着将深度学习模型部署到“边缘”的需求不断增长,这对于开发允许在非常紧密且有限的资源约束中执行最新模型的技术至关重要。在这项工作中,我们提出了一个软件硬件优化范式,用于获得基于完全连接层的深神经网络(DNN)的高效执行引擎。我们的方法集中在压缩方面,是减少面积和功率要求的一种手段,具有高预测性能的多层感知器(MLP)。首先,我们设计了一个名为facnation4的新型硬件体系结构,该架构(1)支持完全连接层的多个紧凑型表示的有效芯片执行,(2)(2)最大程度地减少推理所需的乘数数量降低到仅为4(因此名称)。此外,为了使模型可以在Fantastic4上有效执行,我们引入了一种新颖的熵受限的训练方法,该方法使它们具有稳健的4位量化,并且同时可以同时可压缩大小。实验结果表明,我们可以在虚拟的Ultrascale FPGA XCVU440设备实现上实现2.45个顶部的吞吐量,总功耗为3.6W,并在22NM Process ASIC版本上实现20.17 TOPS/W的总功率效率。与为Google Speech Command(GSC)数据集设计的其他最先进的加速器相比,Fancally4在吞吐量方面的速度更好51 $ \ times $,而面积效率(GOPS/W)方面为145 $ \ times $。
With the growing demand for deploying deep learning models to the "edge", it is paramount to develop techniques that allow to execute state-of-the-art models within very tight and limited resource constraints. In this work we propose a software-hardware optimization paradigm for obtaining a highly efficient execution engine of deep neural networks (DNNs) that are based on fully-connected layers. Our approach is centred around compression as a means for reducing the area as well as power requirements of, concretely, multilayer perceptrons (MLPs) with high predictive performances. Firstly, we design a novel hardware architecture named FantastIC4, which (1) supports the efficient on-chip execution of multiple compact representations of fully-connected layers and (2) minimizes the required number of multipliers for inference down to only 4 (thus the name). Moreover, in order to make the models amenable for efficient execution on FantastIC4, we introduce a novel entropy-constrained training method that renders them to be robust to 4bit quantization and highly compressible in size simultaneously. The experimental results show that we can achieve throughputs of 2.45 TOPS with a total power consumption of 3.6W on a Virtual Ultrascale FPGA XCVU440 device implementation, and achieve a total power efficiency of 20.17 TOPS/W on a 22nm process ASIC version. When compared to the other state-of-the-art accelerators designed for the Google Speech Command (GSC) dataset, FantastIC4 is better by 51$\times$ in terms of throughput and 145$\times$ in terms of area efficiency (GOPS/W).