混合FPGA-CPU平台上的稀疏塔克张量分解

论文标题

混合FPGA-CPU平台上的稀疏塔克张量分解

Sparse Tucker Tensor Decomposition on a Hybrid FPGA-CPU Platform

论文作者

Jiang, Weiyun, Zhang, Kaiqi, Lin, Colin Yu, Xing, Feng, Zhang, Zheng

论文摘要

建议系统，社交网络分析，医学成像和数据挖掘通常涉及处理稀疏的高维数据。这种高维数据自然表示为张量，并且不能通过常规矩阵或矢量计算进行有效处理。稀疏的塔克分解是压缩和分析这些稀疏高维数据集的重要算法。当能源效率和数据隐私是主要问题时，资源约束平台上的硬件加速器对于张量算法的部署至关重要。在这项工作中，我们提出了一个包含CPU和FPGA的混合计算框架，以加速稀疏的塔克分解。该算法具有三个主要模块：Tensor-Times-Matrix（TTM），Kronecker产品和QR分解，列枢轴枢轴（QRP）。此外，我们在Xilinx FPGA上加速了前两个模块，而后者在CPU上加速了。我们的混合平台可实现$ 23.6 \ times \ sim 1091 \ times $速度，超过$ 93.519 \％\ sim 99.514 \％$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $与CPU相比在合成和现实世界中的数据集中。

Recommendation systems, social network analysis, medical imaging, and data mining often involve processing sparse high-dimensional data. Such high-dimensional data are naturally represented as tensors, and they cannot be efficiently processed by conventional matrix or vector computations. Sparse Tucker decomposition is an important algorithm for compressing and analyzing these sparse high-dimensional data sets. When energy efficiency and data privacy are major concerns, hardware accelerators on resource-constraint platforms become crucial for the deployment of tensor algorithms. In this work, we propose a hybrid computing framework containing CPU and FPGA to accelerate sparse Tucker factorization. This algorithm has three main modules: tensor-times-matrix (TTM), Kronecker products, and QR decomposition with column pivoting (QRP). In addition, we accelerate the former two modules on a Xilinx FPGA and the latter one on a CPU. Our hybrid platform achieves $23.6 \times \sim 1091\times$ speedup and over $93.519\% \sim 99.514 \%$ energy savings compared with CPU on the synthetic and real-world datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题