平行QR块低级矩阵的QR分解

论文标题

平行QR块低级矩阵的QR分解

Parallel QR Factorization of Block Low-Rank Matrices

论文作者

Apriansyah, M. Ridwan, Yokota, Rio

论文摘要

我们提出了两种新算法，用于块低级别（BLR）矩阵的居民QR分解：一种执行块柱QR的算法，另一种是基于瓷砖QR的。我们展示了Block-Olgumn算法如何利用BLR结构来实现$ \ MATHCAL {O}（Mn）$的算术复杂性，而瓷砖Blr-QR展示了$ \ Mathcal {O}（Mn^{1.5}）$复杂度。但是，瓷砖BLR-QR具有更精细的任务粒度，可以在共享内存系统上进行并行基于任务的执行。我们使用基于任务的并行性使用叉-Join并行性与瓷砖BLR-QR比较块柱的BLR-QR。我们还使用Fork-Join并行性将这两种居民BLR-QR的实现与块柱修改的革兰氏schmidt（MGS）BLR-QR进行了比较，以及在Intel MKL中使用的最先进的供应商供应商在Intel MKL中使用的最高供应商QR。对于131k $ \ times $ 65K的矩阵，所有BLR方法的数量级都比MKL中的密集QR快。与现有的基于MGS的方法相比，我们的方法对不良条件也具有强大的核能和更高的正交因素。在带有64个内核的CPU上，我们的平行瓷砖住户和块柱的住户算法分别显示了50次和37次的加速度。

We present two new algorithms for Householder QR factorization of Block Low-Rank (BLR) matrices: one that performs block-column-wise QR, and another that is based on tiled QR. We show how the block-column-wise algorithm exploits BLR structure to achieve arithmetic complexity of $\mathcal{O}(mn)$, while the tiled BLR-QR exhibits $\mathcal{O}(mn^{1.5})$ complexity. However, the tiled BLR-QR has finer task granularity that allows parallel task-based execution on shared memory systems. We compare the block-column-wise BLR-QR using fork-join parallelism with tiled BLR-QR using task-based parallelism. We also compare these two implementations of Householder BLR-QR with a block-column-wise Modified Gram-Schmidt (MGS) BLR-QR using fork-join parallelism, and a state-of-the-art vendor-optimized dense Householder QR in Intel MKL. For a matrix of size 131k $\times$ 65k, all BLR methods are more than an order of magnitude faster than the dense QR in MKL. Our methods are also robust to ill-conditioning and produce better orthogonal factors than the existing MGS-based method. On a CPU with 64 cores, our parallel tiled Householder and block-column-wise Householder algorithms show a speedup of 50 and 37 times, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题