FPGA上深CNN的记忆效率数据流推断

论文标题

FPGA上深CNN的记忆效率数据流推断

Memory-Efficient Dataflow Inference for Deep CNNs on FPGA

论文作者

Petrica, Lucian, Alonso, Tobias, Kroes, Mairin, Fraser, Nicholas, Cotofana, Sorin, Blott, Michaela

论文摘要

FPGA上的自定义数据流卷积神经网络（CNN）推理加速器是根据芯片内存（OCM）中特定的CNN拓扑和存储参数量身定制的，从而导致高能量效率和低推断潜伏期。但是，在这些加速器中，参数记忆的形状由吞吐量约束决定，并且不能很好地映射到基础OCM，这成为实现瓶颈。在这项工作中，我们提出了一种加速器设计方法 - 频率补偿存储器包装（FCMP） - 提高了数据流加速器的OCM利用率效率，对FPGA OCM的物理结构的吞吐量最小，而无需修改。为了验证我们的方法论，我们将其应用于中型CIFAR-10推理加速器的几个实现，并证明OCM利用率降低了30％，而不会损失推理吞吐量，从而使我们可以将Xilinx Zynq 7020至7012s的加速器移植到施加成本。我们还利用片上重量（通过此加速器体系结构实现的最大拓扑结构）来为量化的RESNET-50 CNN实施自定义数据流FPGA推理加速器。我们证明，通过将FCMP应用于Resnet加速器，可以减轻OCM瓶颈，从而使加速器从ALVEO U250移植到较小的ALVEO U280板，而与替代技术相比，吞吐量损失较小。通过在吞吐量和OCM需求之间提供更细粒度的权衡，FCMP提高了自定义数据流CNN推理在FPGA上的灵活性。

Custom dataflow Convolutional Neural Network (CNN) inference accelerators on FPGA are tailored to a specific CNN topology and store parameters in On-Chip Memory (OCM), resulting in high energy efficiency and low inference latency. However, in these accelerators the shapes of parameter memories are dictated by throughput constraints and do not map well to the underlying OCM, which becomes an implementation bottleneck. In this work, we propose an accelerator design methodology - Frequency Compensated Memory Packing (FCMP) - which improves the OCM utilization efficiency of dataflow accelerators with minimal reduction in throughput and no modifications to the physical structure of FPGA OCM. To validate our methodology, we apply it to several realizations of medium-sized CIFAR-10 inference accelerators and demonstrate up to 30% reduction in OCM utilization without loss of inference throughput, allowing us to port the accelerators from Xilinx Zynq 7020 to 7012S, reducing application cost. We also implement a custom dataflow FPGA inference accelerator for a quantized ResNet-50 CNN, utilizing on-chip weights, the largest topology ever implemented with this accelerator architecture. We demonstrate that by applying FCMP to the ResNet accelerator, the OCM bottleneck is alleviated which enables the accelerator to be ported from Alveo U250 to the smaller Alveo U280 board with less throughput loss compared to alternative techniques. By providing a finer-grained trade off between throughput and OCM requirements, FCMP increases the flexibility of custom dataflow CNN inference designs on FPGA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题