增强共轭梯度方法的数据局部性，用于无限矩阵有限元实现

论文标题

增强共轭梯度方法的数据局部性，用于无限矩阵有限元实现

Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementations

论文作者

Kronbichler, Martin, Sashko, Dmytro, Munch, Peter

论文摘要

这项工作研究了偶联梯度（CG）方法的一种变体，并将其嵌入具有快速无基质操作员评估的高阶有限元方案的背景下，以及像矩阵对角线这样的廉价预处理。依靠数据依赖性分析和适当的自由度枚举，我们将矢量更新和内部产品与矩阵向量产品交织在一起，只有较小的组织开销。结果，CG方法的三个活动向量的向量入口中约有90％从慢速存储器中恰好转移一次。节点级的性能分析和对多达147K核心的扩展研究表明，提议的性能优化的CG方法比标准CG求解器快两倍，以及优化的管道的CG和S-Step CG方法，用于超过处理器缓存的大尺寸的大小，并在附近缩放尺度附近提供相似的性能。

This work investigates a variant of the conjugate gradient (CG) method and embeds it into the context of high-order finite-element schemes with fast matrix-free operator evaluation and cheap preconditioners like the matrix diagonal. Relying on a data-dependency analysis and appropriate enumeration of degrees of freedom, we interleave the vector updates and inner products in a CG iteration with the matrix-vector product with only minor organizational overhead. As a result, around 90% of the vector entries of the three active vectors of the CG method are transferred from slow RAM memory exactly once per iteration, with all additional access hitting fast cache memory. Node-level performance analyses and scaling studies on up to 147k cores show that the CG method with the proposed performance optimizations is around two times faster than a standard CG solver as well as optimized pipelined CG and s-step CG methods for large sizes that exceed processor caches, and provides similar performance near the strong scaling limit.

下载PDF全文

下载文献需遵守相关版权规定

论文标题