论文标题
高吞吐量矩阵矩阵乘法在不对称位宽度操作数之间
High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands
论文作者
论文摘要
不对称位宽度操作数(尤其是8位操作数之间)之间的矩阵乘法可能会成为许多重要工作负载的基本内核,包括神经网络和机器学习。虽然现有的SIMD矩阵乘法指令用于对称位宽度操作数可以通过零或扩展窄操作数来支持混合精度的操作数以匹配其他操作数的大小,但它们无法利用其中一个操作数的狭窄位宽度的好处。我们提出了一项新的SIMD矩阵乘法指令,该指令在其输入(8位和4位操作数)上使用混合精度,并将产品值累积到更狭窄的16位输出累加器中,进而使SIMD在128位矢量宽度上的操作可在未经过程中处理多个指令的数据元素,而无需读取量和记录量,并读取了读取量的读取和读取量,并读取了读取量。与使用现有的Symmetric-Operand-size指令获得的矩阵乘法相比,提出的不对称 - 尺寸SIMD指令可改善矩阵乘法的吞吐量,同时从16位累加器中溢出(0.05%)的溢出,以实现代表机器学习工作量。不仅尺寸的指令不仅可以改善CPU中的矩阵乘法吞吐量,而且还可以有效地支持8位和4位操作数之间的乘数和蓄能(MAC)操作数(MAC)在最先进的DNN硬件加速器中的操作数(例如,不包括Google Sipertry在Google Serect中),并且提供了相似的型号,并且提供了相似的型号。实施约束。我们演示了如何修改专为对称 - 奥科尺寸指令设计的收缩阵列体系结构,以支持非对称 - 尺寸的指令。
Matrix multiplications between asymmetric bit-width operands, especially between 8- and 4-bit operands are likely to become a fundamental kernel of many important workloads including neural networks and machine learning. While existing SIMD matrix multiplication instructions for symmetric bit-width operands can support operands of mixed precision by zero- or sign-extending the narrow operand to match the size of the other operands, they cannot exploit the benefit of narrow bit-width of one of the operands. We propose a new SIMD matrix multiplication instruction that uses mixed precision on its inputs (8- and 4-bit operands) and accumulates product values into narrower 16-bit output accumulators, in turn allowing the SIMD operation at 128-bit vector width to process a greater number of data elements per instruction to improve processing throughput and memory bandwidth utilization without increasing the register read- and write-port bandwidth in CPUs. The proposed asymmetric-operand-size SIMD instruction offers 2x improvement in throughput of matrix multiplication in comparison to throughput obtained using existing symmetric-operand-size instructions while causing negligible (0.05%) overflow from 16-bit accumulators for representative machine learning workloads. The asymmetric-operand-size instruction not only can improve matrix multiplication throughput in CPUs, but also can be effective to support multiply-and-accumulate (MAC) operation between 8- and 4-bit operands in state-of-the-art DNN hardware accelerators (e.g., systolic array microarchitecture in Google TPU, etc.) and offer similar improvement in matrix multiply performance seamlessly without violating the various implementation constraints. We demonstrate how a systolic array architecture designed for symmetric-operand-size instructions could be modified to support an asymmetric-operand-sized instruction.