论文标题
GPU加速的Barycentric Lagrange Treecode
A GPU-Accelerated Barycentric Lagrange Treecode
论文作者
论文摘要
我们提出了无独立的Barycentric Lagrange Treecode(BLTC)的MPI + OpenACC实现,以快速求和GPU上的粒子相互作用。分布式存储器并行化使用递归坐标归构域分解和MPI远程内存访问,以在每个等级上构建本地必需树。将粒子相互作用组织到目标批处理/源群集相互作用中,这些相互作用有效地映射到GPU上。目标批处理提供了平行性的外部级别,而Barycentric粒子群集近似的直接总和形式则提供了并行性的内部水平。 GPU加速的BLTC性能在多个通过库仑电势和Yukawa势相互作用的测试用例中证明了。
We present an MPI + OpenACC implementation of the kernel-independent barycentric Lagrange treecode (BLTC) for fast summation of particle interactions on GPUs. The distributed memory parallelization uses recursive coordinate bisection for domain decomposition and MPI remote memory access to build locally essential trees on each rank. The particle interactions are organized into target batch/source cluster interactions which efficiently map onto the GPU; target batching provides an outer level of parallelism, while the direct sum form of the barycentric particle-cluster approximation provides an inner level of parallelism. The GPU-accelerated BLTC performance is demonstrated on several test cases up to 1~billion particles interacting via the Coulomb potential and Yukawa potential.