论文标题
通过压缩稀疏行格式的异质稀疏矩阵矢量乘法
Heterogeneous Sparse Matrix-Vector Multiplication via Compressed Sparse Row Format
论文作者
论文摘要
稀疏矩阵矢量乘法(SPMV)是高性能计算(HPC)中最重要的内核之一,但SPMV通常在许多设备上都患有不良性能。由于性能不佳,SPMV通常需要特殊护理来存储并调整给定设备。此外,HPC面临着包含多个不同计算单元的异质硬件,例如多核CPU和GPU。因此,一个新兴的目标是产生允许关键内核(例如SPMV)在具有便携性能和格式和方法最小变化的不同设备上执行关键内核(例如SPMV)的异质格式和方法。 This paper presents a heterogeneous format based on CSR, named CSR-k, that can be tuned quickly and outperforms the average performance of Intel MKL on Intel Xeon Platinum 8380 and AMD Epyc 7742 CPUs while still outperforming NVIDIA's cuSPARSE and Sandia National Laboratories' KokkosKernels on NVIDIA A100 and V100 for regular sparse矩阵,即稀疏的矩阵,其中每行的非齐率的数量具有差异$ \ leq $ 10,例如通常是从两个和三维有限差异和元素问题产生的矩阵。特别是,CSR-K通过重新排序并将行分组为超级排和超级行的层次结构来实现这一目标,这些结构仅由几个额外的指针表示。由于其简单性,可以为设备调整模型,并用于在恒定时间内选择超排和超级排的大小。
Sparse matrix-vector multiplication (SpMV) is one of the most important kernels in high-performance computing (HPC), yet SpMV normally suffers from ill performance on many devices. Due to ill performance, SpMV normally requires special care to store and tune for a given device. Moreover, HPC is facing heterogeneous hardware containing multiple different compute units, e.g., many-core CPUs and GPUs. Therefore, an emerging goal has been to produce heterogeneous formats and methods that allow critical kernels, e.g., SpMV, to be executed on different devices with portable performance and minimal changes to format and method. This paper presents a heterogeneous format based on CSR, named CSR-k, that can be tuned quickly and outperforms the average performance of Intel MKL on Intel Xeon Platinum 8380 and AMD Epyc 7742 CPUs while still outperforming NVIDIA's cuSPARSE and Sandia National Laboratories' KokkosKernels on NVIDIA A100 and V100 for regular sparse matrices, i.e., sparse matrices where the number of nonzeros per row has a variance $\leq$ 10, such as those commonly generated from two and three-dimensional finite difference and element problems. In particular, CSR-k achieves this with reordering and by grouping rows into a hierarchical structure of super-rows and super-super-rows that are represented by just a few extra arrays of pointers. Due to its simplicity, a model can be tuned for a device and used to select super-row and super-super-rows sizes in constant time.