稀疏张量CP分解的分布式内存随机算法

论文标题

稀疏张量CP分解的分布式内存随机算法

Distributed-Memory Randomized Algorithms for Sparse Tensor CP Decomposition

论文作者

Bharadwaj, Vivek, Malik, Osman Asif, Murray, Riley, Buluç, Aydin, Demmel, James

论文摘要

candecomp / parafac（CP）分解是对高维张量的矩阵奇异值分解的概括，是分析多维稀疏数据的流行工具。在具有数十亿个非零条目的张量中，计算CP分解是一项计算密集的任务。我们提出了两种随机CP分解算法CP-ARLS-LEV和STS-CP的第一个分布式内存实现，这些算法在高分解良好的非传递分解套件上几乎提供了高度分解的速度速度。这两种算法都依赖于杠杆评分抽样并享有强大的理论保证，每种保证都有不同的时间和准确性的权衡。我们为随机抽样算法定制沟通时间表，消除了昂贵的减少集体，并通过随机样本计数迫使沟通成本进行扩展。最后，我们优化了我们方法的本地存储格式，在压缩稀疏列的类似物和压缩的稀疏行格式之间切换。实验表明，我们的方法是快速且可扩展的，通过在不到两分钟的时间内将512个CPU内核上的数十亿个尺寸的reddit张量分解，从而在Splatt上产生了11倍的加速。

Candecomp / PARAFAC (CP) decomposition, a generalization of the matrix singular value decomposition to higher-dimensional tensors, is a popular tool for analyzing multidimensional sparse data. On tensors with billions of nonzero entries, computing a CP decomposition is a computationally intensive task. We propose the first distributed-memory implementations of two randomized CP decomposition algorithms, CP-ARLS-LEV and STS-CP, that offer nearly an order-of-magnitude speedup at high decomposition ranks over well-tuned non-randomized decomposition packages. Both algorithms rely on leverage score sampling and enjoy strong theoretical guarantees, each with varying time and accuracy tradeoffs. We tailor the communication schedule for our random sampling algorithms, eliminating expensive reduction collectives and forcing communication costs to scale with the random sample count. Finally, we optimize the local storage format for our methods, switching between analogues of compressed sparse column and compressed sparse row formats. Experiments show that our methods are fast and scalable, producing 11x speedup over SPLATT by decomposing the billion-scale Reddit tensor on 512 CPU cores in under two minutes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题