使用现代C ++缩小性能差距

论文标题

使用现代C ++缩小性能差距

Closing the Performance Gap with Modern C++

论文作者

Heller, Thomas, Kaiser, Hartmut, Diehl, Patrick, Fey, Dietmar, Schweitzer, Marc Alexander

论文摘要

在前往Exascale的途中，程序员面临着越来越多的挑战，即必须从相同的代码库中支持多个硬件体系结构。同时，随着硬件体系结构变得越来越多样化，代码和性能的可移植性越来越难以实现。当今的异质系统通常包括两个或更加完全不同且不兼容的硬件执行模型，例如GPGPU，SIMD向量单元和通用核心，这些核心必须使用代表非重叠编程模型的单独的工具链来对其进行编程。在C ++语言中，最近对该行业的兴趣和更广泛的社区的复兴刺激了并发和并行性领域中的大量标准化建议和技术规格。最近，围绕C ++标准靶向异质和分布式计算的统一，更高级别的抽象和编程模型的需求进行了越来越多的讨论。这样的抽象应该与现有的，已经标准化的语言和库功能完美融合，但也应该足够通用以支持将来的硬件开发。在本文中，我们介绍了C ++中为并行性开发这样的高级编程抽象的结果，该抽象旨在在广泛的体系结构和各种并行性的情况下启用代码和性能可移植性。我们介绍并比较从运行众所周知的流基准移植到我们更高级别C ++抽象的性能数据以及本地运行的相应结果。我们表明，我们的抽象能够至少与可比的基线基准测试一样出色，同时在所有比较目标架构上提供统一的编程API。

On the way to Exascale, programmers face the increasing challenge of having to support multiple hardware architectures from the same code base. At the same time, portability of code and performance are increasingly difficult to achieve as hardware architectures are becoming more and more diverse. Today's heterogeneous systems often include two or more completely distinct and incompatible hardware execution models, such as GPGPU's, SIMD vector units, and general purpose cores which conventionally have to be programmed using separate tool chains representing non-overlapping programming models. The recent revival of interest in the industry and the wider community for the C++ language has spurred a remarkable amount of standardization proposals and technical specifications in the arena of concurrency and parallelism. This recently includes an increasing amount of discussion around the need for a uniform, higher-level abstraction and programming model for parallelism in the C++ standard targeting heterogeneous and distributed computing. Such an abstraction should perfectly blend with existing, already standardized language and library features, but should also be generic enough to support future hardware developments. In this paper, we present the results from developing such a higher-level programming abstraction for parallelism in C++ which aims at enabling code and performance portability over a wide range of architectures and for various types of parallelism. We present and compare performance data obtained from running the well-known STREAM benchmark ported to our higher level C++ abstraction with the corresponding results from running it natively. We show that our abstractions enable performance at least as good as the comparable base-line benchmarks while providing a uniform programming API on all compared target architectures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题