论文标题
I/O-最佳算法,用于对称线性代数内核
I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels
论文作者
论文摘要
在本文中,我们考虑了线性代数中的两个基本对称内核:Cholesky分解和对称排名$ K $ Update(Syrk),以及这些内核的经典三个嵌套环算法。此外,我们考虑了一个机器模型,其大小$ s $的快速内存和无限的慢速内存。在此模型中,所有计算都必须在快速内存中的操作数上执行,目标是最大程度地减少慢速和快速记忆之间的通信量。 由于计算集是通过选择算法来确定的,因此只有计算(时间表)的顺序直接影响通信的数量。 n $对称的正定矩阵,以及$ \ frac {1} {\ sqrt {2}} \ frac {n^2m} {\ sqrt {s s}} $用于$ \ hat {a} \ cdot \ cdot \ cdot \ cdot \ cdot \ c. $ \ mathbf {a} $是$ n \ times m $矩阵。这两个界限都通过因子$ \ sqrt {2} $提高了文献中最著名的下限。此外,我们提出了两个具有匹配通信量的固定外,顺序算法:syrk的\ tbs,带有一卷的syrk $ \ frac {1} {\ sqrt {2}}} \ frac {n^2m} {\ sqrt {\ sqrt {s}}} + \ bigo {nm \ log n} $,\ lbc for Cholesky,\ lbc for Cholesky,具有体积$ \ frac {1} {3 \ sqrt {2}}} \ frac {n^3} {\ sqrt {s}}} + \ bigo {n^{5/2}}} $。两种算法都通过因子$ \ sqrt {2} $来改善文献中最著名的算法,并证明无法进一步提高我们的下限中的领先术语。这项工作表明,诸如Syrk或Cholesky之类的对称内核的操作强度在本质上(通过因子$ \ sqrt {2} $)高于相应的非对称核(GEMM和LU分解)。
In this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rank-$k$ update (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we consider a machine model with a fast memory of size $S$ and an unbounded slow memory. In this model, all computations must be performed on operands in fast memory, and the goal is to minimize the amount of communication between slow and fast memories. As the set of computations is fixed by the choice of the algorithm, only the ordering of the computations (the schedule) directly influences the volume of communications.We prove lower bounds of $\frac{1}{3\sqrt{2}}\frac{N^3}{\sqrt{S}}$ for the communication volume of the Cholesky factorization of an $N\times N$ symmetric positive definite matrix, and of $\frac{1}{\sqrt{2}}\frac{N^2M}{\sqrt{S}}$ for the SYRK computation of $\mat{A}\cdot\transpose{\mat{A}}$, where $\mathbf{A}$ is an $N\times M$ matrix. Both bounds improve the best known lower bounds from the literature by a factor $\sqrt{2}$.In addition, we present two out-of-core, sequential algorithms with matching communication volume: \TBS for SYRK, with a volume of $\frac{1}{\sqrt{2}}\frac{N^2M}{\sqrt{S}} + \bigo{NM\log N}$, and \LBC for Cholesky, with a volume of $\frac{1}{3\sqrt{2}}\frac{N^3}{\sqrt{S}} + \bigo{N^{5/2}}$. Both algorithms improve over the best known algorithms from the literature by a factor $\sqrt{2}$, and prove that the leading terms in our lower bounds cannot be improved further. This work shows that the operational intensity of symmetric kernels like SYRK or Cholesky is intrinsically higher (by a factor $\sqrt{2}$) than that of corresponding non-symmetric kernels (GEMM and LU factorization).