使用OPENCL上的FPGA上的高性能高级模板计算

论文标题

使用OPENCL上的FPGA上的高性能高级模板计算

High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

论文作者

Zohouri, Hamid Reza, Podobas, Artur, Matsuoka, Satoshi

论文摘要

在本文中，我们评估了使用高级合成的FPGA进行高阶模板计算的性能。我们表明，尽管与一阶相比，此类模具的计算强度和片上记忆的要求更高，但我们的设计技术具有合并的空间和时间阻滞仍然有效。与一阶模具相比，这使我们能够达到相似甚至更高的计算性能。我们使用基于OpenCL的设计，除了参数化性能旋钮外，还可以参数化模板半径。此外，我们表明，我们的性能模型在预测高阶表现时表现出与一阶模具相同的精度。在Intel Arria 10 GX 1150设备上，对于2D和3D星形模板，我们分别达到700和270 GFLOP/S的计算性能，最多达到了四个模板半径为四个。这些结果的表现优于现代Xeon的最先进的Yask框架，用于2D和3D模具，并且在2D模板上的现代Xeon Phi胜过，同时在3D中实现了竞争性能。此外，我们的FPGA设计在几乎所有情况下都可以提高功率效率。

In this paper we evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis. We show that despite the higher computation intensity and on-chip memory requirement of such stencils compared to first-order ones, our design technique with combined spatial and temporal blocking remains effective. This allows us to reach similar, or even higher, compute performance compared to first-order stencils. We use an OpenCL-based design that, apart from parameterizing performance knobs, also parameterizes the stencil radius. Furthermore, we show that our performance model exhibits the same accuracy as first-order stencils in predicting the performance of high-order ones. On an Intel Arria 10 GX 1150 device, for 2D and 3D star-shaped stencils, we achieve over 700 and 270 GFLOP/s of compute performance, respectively, up to a stencil radius of four. These results outperform the state-of-the-art YASK framework on a modern Xeon for 2D and 3D stencils, and outperform a modern Xeon Phi for 2D stencils, while achieving competitive performance in 3D. Furthermore, our FPGA design achieves better power efficiency in almost all cases.

下载PDF全文

下载文献需遵守相关版权规定

论文标题