Logicnet：共同设计的神经网络和电路

论文标题

Logicnet：共同设计的神经网络和电路

LogicNets: Co-Designed Neural Networks and Circuits for Extreme-Throughput Applications

论文作者

Umuroglu, Yaman, Akhauri, Yash, Fraser, Nicholas J., Blott, Michaela

论文摘要

在需要很高吞吐量或极低延迟的应用程序中部署深度神经网络是一个严重的计算挑战，这进一步加剧了效率在将计算映射到硬件中的效率。我们提出了一种新的方法，用于设计直接映射到高效的FPGA实现的神经网络拓扑。通过利用具有量化输入/输出和真实表的人工神经元的等效性，我们可以训练可以直接转换为真实表的网列的量化神经网络，然后将其部署为高度可提供的，庞大的平行FPGA电路。但是，神经网络拓扑需要仔细考虑，因为真相表的硬件成本随着神经元的粉丝而成倍增长。为了获得较小的网络，可以将整个NetList放置在单个FPGA上，我们得出了一个粉丝驱动的硬件成本模型，以指导拓扑设计，并将高稀疏性与低位激活量化相结合以限制神经元风扇。我们在高能物理学和网络入侵检测中对两个任务进行了非常高的固有吞吐量要求评估我们的方法。我们表明，稀疏性和低位激活量化的组合导致具有较小逻辑深度和低LUT成本的高速电路，在每秒数亿美元的推论中表现出竞争精度，而竞争精度却少于15 nn。

Deployment of deep neural networks for applications that require very high throughput or extremely low latency is a severe computational challenge, further exacerbated by inefficiencies in mapping the computation to hardware. We present a novel method for designing neural network topologies that directly map to a highly efficient FPGA implementation. By exploiting the equivalence of artificial neurons with quantized inputs/outputs and truth tables, we can train quantized neural networks that can be directly converted to a netlist of truth tables, and subsequently deployed as a highly pipelinable, massively parallel FPGA circuit. However, the neural network topology requires careful consideration since the hardware cost of truth tables grows exponentially with neuron fan-in. To obtain smaller networks where the whole netlist can be placed-and-routed onto a single FPGA, we derive a fan-in driven hardware cost model to guide topology design, and combine high sparsity with low-bit activation quantization to limit the neuron fan-in. We evaluate our approach on two tasks with very high intrinsic throughput requirements in high-energy physics and network intrusion detection. We show that the combination of sparsity and low-bit activation quantization results in high-speed circuits with small logic depth and low LUT cost, demonstrating competitive accuracy with less than 15 ns of inference latency and throughput in the hundreds of millions of inferences per second.

下载PDF全文

下载文献需遵守相关版权规定

论文标题