论文标题
fPGA上的高吞吐量多维三基因系统求解器
High Throughput Multidimensional Tridiagonal Systems Solvers on FPGAs
论文作者
论文摘要
我们提出了一个设计空间探索,用于合成FPGA上多个多维Tridiagonal系统求解器的优化的高通量实现。重新评估算法的特征,用于三角线系统的直接解决方案,我们开发了一个新的三角形求解器库,旨在在Xilinx FPGA硬件上实现高性能计算应用程序。库的主要新功能是(1)统一标准的最先进技术,用于实施隐式数值求解器,具有许多新型的高增强优化,例如矢量化和批处理,由多维系统的动机,在现实世界中的多维系统中,(2)数据 - 弗洛(2)数据概述,包括针对2D和3D的应用程序的探讨,包括探讨2D和3D的应用程序,以实现2D和3D的应用程序。 (3)开发分析模型以探索设计空间并获得快速绩效估算。与Xilinx当前的Tridia -Gonal求解器库相比,新库为解决大量系统提供了更好的数量级性能。使用Xilinx肺泡U280 FPGA上的新求解器实施了两个代表性应用,显示了超过85%的预测模型精度。将这些与当前最新的GPU库进行了比较,该库用于在NVIDIA V100 GPU上求解多维的三维系统,分析了解决方案,带宽和能耗的时间。结果表明,与V100 GPU相比,FPGA在一系列多维问题方面实现了竞争性或更好的运行时性能。此外,量化了FPGA实施提供的大量节能,对于最复杂的应用程序,超过30%。我们讨论在FPGA上获得良好性能所需的算法权衡,从而深入了解FPGA实施的可行性和获利能力。
We present a design space exploration for synthesizing optimized, high-throughput implementations of multiple multi-dimensional tridiagonal system solvers on FPGAs. Re-evaluating the characteristics of algorithms for the direct solution of tridiagonal systems, we develop a new tridiagonal solver library aimed at implementing high-performance computing applications on Xilinx FPGA hardware. Key new features of the library are (1) the unification of standard state-of-the-art techniques for implementing implicit numerical solvers with a number of novel high-gain optimizations such as vectorization and batching, motivated by multi-dimensional systems in real-world applications, (2) data-flow techniques that provide application specific optimizations for both 2D and 3D problems, including integration of explicit loops commonplace in real workloads, and (3) the development of an analytic model to explore the design space, and obtain rapid performance estimates. The new library provide an order of magnitude better performance for solving large batches of systems compared to Xilinx's current tridiagonal solver library. Two representative applications are implemented using the new solver on a Xilinx Alveo U280 FPGA, demonstrating over 85% predictive model accuracy. These are compared with a current state-of-the-art GPU library for solving multi-dimensional tridiagonal systems on an Nvidia V100 GPU, analyzing time to solution, bandwidth, and energy consumption. Results show the FPGAs achieving competitive or better runtime performance for a range of multi-dimensional problems compared to the V100 GPU. Additionally, the significant energy savings offered by FPGA implementations, over 30% for the most complex application, are quantified. We discuss the algorithmic trade-offs required to obtain good performance on FPGAs, giving insights into the feasibility and profitability of FPGA implementations.