HEROV2：用于异质计算的全栈开源研究平台

论文标题

HEROV2：用于异质计算的全栈开源研究平台

HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous Computing

论文作者

Kurth, Andreas, Forsberg, Björn, Benini, Luca

论文摘要

异质计算机将通用宿主处理器与域特异性加速器集成在一起，以将多功能性与效率和高性能相结合。为了实现异质计算机的全部潜力，必须克服许多硬件和软件设计挑战。尽管可以使用架构和系统模拟器来分析异质计算机，但它们在模拟速度和性能建模精度之间面临不可避免的妥协。在这项工作中，我们介绍了Herov2，这是一个基于FPGA的研究平台，可以基于32位RISC-V核心和应用程序级的64位ARMV8或RV64宿主处理器的簇进行准确，快速探索的异质计算机。 HEROV2允许在64位主机和32位加速器之间无缝共享数据，并配备完全开源的芯片网络，统一的异质编程界面以及基于LLVM的混合DATA模型，混合DATA模型，混合ISA异质编译器。我们在四个案例研究中从应用级别的工具链和系统体系结构降至加速器微体系结构中评估了Herov2。我们演示了Herov2如何在全异构计算的完整堆栈中实现有效的研究和开发。例如，编译器可以铺平循环并从加速器中推断数据传输，与原始程序相比，加速器的速度最高为4.4倍，并且在大多数情况下，比手写实现速度慢15％，该实现需要2.6倍的代码。

Heterogeneous computers integrate general-purpose host processors with domain-specific accelerators to combine versatility with efficiency and high performance. To realize the full potential of heterogeneous computers, however, many hardware and software design challenges have to be overcome. While architectural and system simulators can be used to analyze heterogeneous computers, they are faced with unavoidable compromises between simulation speed and performance modeling accuracy. In this work we present HEROv2, an FPGA-based research platform that enables accurate and fast exploration of heterogeneous computers consisting of accelerators based on clusters of 32-bit RISC-V cores and an application-class 64-bit ARMv8 or RV64 host processor. HEROv2 allows to seamlessly share data between 64-bit hosts and 32-bit accelerators and comes with a fully open-source on-chip network, a unified heterogeneous programming interface, and a mixed-data-model, mixed-ISA heterogeneous compiler based on LLVM. We evaluate HEROv2 in four case studies from the application level over toolchain and system architecture down to accelerator microarchitecture. We demonstrate how HEROv2 enables effective research and development on the full stack of heterogeneous computing. For instance, the compiler can tile loops and infer data transfers to and from the accelerators, which leads to a speedup of up to 4.4x compared to the original program and in most cases is only 15 % slower than a handwritten implementation, which requires 2.6x more code.

下载PDF全文

下载文献需遵守相关版权规定

论文标题