论文标题
加速基于任务的迭代应用程序
Accelerating Task-based Iterative Applications
论文作者
论文摘要
基于任务的编程模型已成为传统叉-Join并行性的替代品的普及。他们更适合用不规则的并行性编写应用程序,以表现出负载失衡。但是,这些编程模型遭受了与任务创建,调度和依赖关系管理相关的间接费用,当任务变得太小时限制了性能和可伸缩性。同时,许多HPC应用程序实现了迭代方法或多步仿真,这些方法在每次迭代中创建了与任务相同的定向无环图。 通过为应用程序程序员提供一种表达特定循环在每个迭代上创建相同的任务模式的方法,我们可以创建一个任务DAG并将其转换为循环图。然后,将此循环图重复使用以进行连续的迭代,从而最大程度地减少了任务创建和依赖关系管理开销。本文介绍了Taskiter,这是我们为OPMS-2和OpenMP编程模型提出的新结构,允许使用有向的循环任务图(DCTG)来最大程度地减少运行时开销。此外,我们提出了一个简单的直接后继局部启发式启发式,可以通过绕过运行时任务调度程序来最大程度地减少任务调度开销。 我们在8个迭代基准中评估了任务师和直接继任者的实施。使用小型任务粒度,我们在参考oPMSS-2实现上的平均加速度为3.7倍,在LLVM和GCC OpenMP Runtimes上平均速度为5倍和7.46倍加速。
Task-based programming models have risen in popularity as an alternative to traditional fork-join parallelism. They are better suited to write applications with irregular parallelism that can present load imbalance. However, these programming models suffer from overheads related to task creation, scheduling and dependency management, limiting performance and scalability when tasks become too small. At the same time, many HPC applications implement iterative methods or multi-step simulations that create the same directed acyclic graphs of tasks on each iteration. By giving application programmers a way to express that a specific loop is creating the same task pattern on each iteration, we can create a single task DAG once and transform it into a cyclic graph. This cyclic graph is then reused for successive iterations, minimizing task creation and dependency management overhead. This paper presents the taskiter, a new construct we propose for the OmpSs-2 and OpenMP programming models, allowing the use of directed cyclic task graphs (DCTG) to minimize runtime overheads. Moreover, we present a simple immediate successor locality-aware heuristic that minimizes task scheduling overhead by bypassing the runtime task scheduler. We evaluate the implementation of the taskiter and the immediate successor heuristic in 8 iterative benchmarks. Using small task granularities, we obtain an average speedup of 3.7x over the reference OmpSs-2 implementation and an average of 5x and 7.46x speedup over the LLVM and GCC OpenMP runtimes, respectively.