论文标题
ARM CPU上二进制和三元CNN的快速矩阵乘法
Fast matrix multiplication for binary and ternary CNNs on ARM CPU
论文作者
论文摘要
低位量化的神经网络在实际应用中引起了极大的兴趣,因为它们会大大减少记忆和计算资源的消耗。二进制神经网络是内存和计算上有效的,因为它们只需要一个每重量和激活,并且可以使用布尔逻辑和位计数操作来计算。与BNN相比,具有三元重量和激活以及二元重量和三元激活的QNN旨在提高识别质量,同时保持低位宽度。但是,他们的有效实施通常在ASIC和FPGA上考虑,从而限制了它们在现实生活任务中的适用性。同时,最需要有效识别的领域之一是使用其CPU在移动设备上识别。但是,没有已知的TBN和TNN的快速实现,只有用于BNNS推断的DABNN库。在本文中,我们提出了针对具有ARM架构的移动设备的三元,三元二元和二进制矩阵乘法的新型快速算法。在我们的算法中,使用2位编码和二进制 - 使用一位表示三元重量。它使我们可以使用ARM NEON SIMD扩展名替换可以在128位上计算的布尔逻辑操作替换矩阵乘法。矩阵乘法结果积累在16位整数寄存器中。我们还使用左右矩阵中值的特殊重新排序。所有这些使我们能够有效地计算矩阵产品,同时与DABNN算法相比,将负载和存储的数量最小化。我们的算法可用于实施TNN,TBN和BNN的卷积和完全连接层的推断。我们在ARM Cortex-A73 CPU上对它们进行了实验评估,并将其推理速度与有效实现完整精液,8位和4位量化矩阵乘法进行了比较。
Low-bit quantized neural networks are of great interest in practical applications because they significantly reduce the consumption of both memory and computational resources. Binary neural networks are memory and computationally efficient as they require only one bit per weight and activation and can be computed using Boolean logic and bit count operations. QNNs with ternary weights and activations and binary weights and ternary activations aim to improve recognition quality compared to BNNs while preserving low bit-width. However, their efficient implementation is usually considered on ASICs and FPGAs, limiting their applicability in real-life tasks. At the same time, one of the areas where efficient recognition is most in demand is recognition on mobile devices using their CPUs. However, there are no known fast implementations of TBNs and TNN, only the daBNN library for BNNs inference. In this paper, we propose novel fast algorithms of ternary, ternary-binary, and binary matrix multiplication for mobile devices with ARM architecture. In our algorithms, ternary weights are represented using 2-bit encoding and binary - using one bit. It allows us to replace matrix multiplication with Boolean logic operations that can be computed on 128-bits simultaneously, using ARM NEON SIMD extension. The matrix multiplication results are accumulated in 16-bit integer registers. We also use special reordering of values in left and right matrices. All that allows us to efficiently compute a matrix product while minimizing the number of loads and stores compared to the algorithm from daBNN. Our algorithms can be used to implement inference of convolutional and fully connected layers of TNNs, TBNs, and BNNs. We evaluate them experimentally on ARM Cortex-A73 CPU and compare their inference speed to efficient implementations of full-precision, 8-bit, and 4-bit quantized matrix multiplications.