论文标题
通过微型计算分解张量核心:延迟,吞吐量和数字行为
Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors
论文作者
论文摘要
自Volta体系结构以来,张量芯一直是加速所有NVIDIA GPU中融合型矩阵繁殖积累(MMA)的重要单位。为了编程张量核心,用户必须使用传统WMMA API或当前的MMA API。传统WMMA API更易于使用,但只能利用有限的功能和张量芯的功能。具体而言,WMMA API支持更少的操作数形状,并且无法利用最新安培张量核心的新稀疏矩阵乘法功能。但是,当前编程接口的性能尚未得到很好的探索。此外,由最新的安培张量核心支持的低精度浮点(TF32,BF16和FP16)的计算数字行为也很神秘。在本文中,我们探讨了当前编程API的吞吐量和延迟。我们还直观地研究了张量核MMA的数字行为,并介绍了中间操作,包括乘法,内部产物的添加和积累。这项工作中使用的所有代码都可以在https://github.com/sunlex0717/dissectingtensorcores中找到。
Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs. Legacy wmma APIs are more easy-to-use but can only exploit limited features and power of Tensor Cores. Specifically, wmma APIs support fewer operand shapes and can not leverage the new sparse matrix multiplication feature of the newest Ampere Tensor Cores. However, the performance of current programming interface has not been well explored. Furthermore, the computation numeric behaviors of low-precision floating points (TF32, BF16, and FP16) supported by the newest Ampere Tensor Cores are also mysterious. In this paper, we explore the throughput and latency of current programming APIs. We also intuitively study the numeric behaviors of Tensor Cores MMA and profile the intermediate operations including multiplication, addition of inner product, and accumulation. All codes used in this work can be found in https://github.com/sunlex0717/DissectingTensorCores.