优化对异质多核体系结构的流流并行性：一种基于机器学习的方法

论文标题

优化对异质多核体系结构的流流并行性：一种基于机器学习的方法

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach

论文作者

Zhang, Peng, Fang, Jianbin, Yang, Canqun, Huang, Chun, Tang, Tao, Wang, Zheng

论文摘要

本文介绍了一种自动方法，可以快速得出用于硬件资源分区和任务粒度的良好解决方案，用于基于任务的并行应用程序的多核体系结构。我们的方法采用绩效模型来估算给定资源分区和任务粒度配置下的目标应用程序的产生性能。该模型用作实用程序，可以快速在运行时搜索良好的配置。我们使用机器学习技术自动学习它，而不是手工制作分析模型，而不是需要专家洞悉低级硬件细节。我们首先使用培训计划来离线学习预测模型来实现这一目标。然后可以使用学习的模型来预测运行时任何看不见程序的性能。我们将方法应用于39个代表性并联应用程序，并在两个代表性的异质多核平台上进行评估：CPU-Xeonphi平台和一个CPU-GPU平台。与单流版本相比，我们的方法平均在Xeonphi和GPU平台上达到了1.6倍和1.1倍的速度。这些结果转化为理论上完美的预测指标所提供的性能的93％以上。

This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93% of the performance delivered by a theoretically perfect predictor.

下载PDF全文

下载文献需遵守相关版权规定

论文标题