Halcone：用于多GPU系统的基于硬件级的基于时间戳的高速缓存相干方案

论文标题

Halcone：用于多GPU系统的基于硬件级的基于时间戳的高速缓存相干方案

HALCONE : A Hardware-Level Timestamp-based Cache Coherence Scheme for Multi-GPU systems

论文作者

Mojumder, Saiful A., Sun, Yifan, Delshadtehrani, Leila, Ma, Yenai, Baruah, Trinayan, Abellán, José L., Kim, John, Kaeli, David, Joshi, Ajay

论文摘要

尽管多GPU（MGPU）系统在计算密集型工作负载中非常受欢迎，但记忆层次结构和数据移动中的几种效率低下会导致浪费GPU资源和编程MGPU系统中的困难。首先，由于缺乏硬件级相干性，MGPU编程模型要求程序员在GPU的内存之间复制并反复传输数据。这会导致低效使用宝贵的GPU内存。其次，为了维持MGPU系统的相干性，使用低频带宽度和高延迟片段链路传输数据会导致系统性能降解。第三，由于程序员需要手动保持数据连贯性，因此MGPU系统的编程以最大化其吞吐量是极具挑战性的。为了解决上述问题，我们提出了一种新型的基于时间戳的连贯协议，HALCONE，用于MGPU系统，并修改GPU的内存层次结构，以支持物理共享的内存。半酮用缓存级别的逻辑时间计数器替换计算单元（CU）级别的逻辑时间计数器，以减少连贯流量。此外，Halcone引入了一个新型的时间戳存储单元（TSU），在主内存中没有其他性能开销以执行连贯性动作。我们提出的半酮协议在MGPU的内存层次结构中保持数据连贯性，其性能开销最少（小于1 \％）。使用一组标准MGPU基准测试，我们观察到，具有共享内存和HALCONE的4-GPU MGPU系统平均可以执行4.6 $ \ times $ $ \ times $和3 $ \ times $ $ $ $比4-GPU MGPU系统（具有现有RDMA）和最近提出的HMG COHRENCE协议。我们使用不同的GPU计数（2、4、8和16）以及不同的CU计数（每GPU 32、48和64 CU）证明了半酮的可伸缩性，以11个标准的基准测试。

While multi-GPU (MGPU) systems are extremely popular for compute-intensive workloads, several inefficiencies in the memory hierarchy and data movement result in a waste of GPU resources and difficulties in programming MGPU systems. First, due to the lack of hardware-level coherence, the MGPU programming model requires the programmer to replicate and repeatedly transfer data between the GPUs' memory. This leads to inefficient use of precious GPU memory. Second, to maintain coherency across an MGPU system, transferring data using low-bandwidth and high-latency off-chip links leads to degradation in system performance. Third, since the programmer needs to manually maintain data coherence, the programming of an MGPU system to maximize its throughput is extremely challenging. To address the above issues, we propose a novel lightweight timestamp-based coherence protocol, HALCONE, for MGPU systems and modify the memory hierarchy of the GPUs to support physically shared memory. HALCONE replaces the Compute Unit (CU) level logical time counters with cache level logical time counters to reduce coherence traffic. Furthermore, HALCONE introduces a novel timestamp storage unit (TSU) with no additional performance overhead in the main memory to perform coherence actions. Our proposed HALCONE protocol maintains the data coherence in the memory hierarchy of the MGPU with minimal performance overhead (less than 1\%). Using a set of standard MGPU benchmarks, we observe that a 4-GPU MGPU system with shared memory and HALCONE performs, on average, 4.6$\times$ and 3$\times$ better than a 4-GPU MGPU system with existing RDMA and with the recently proposed HMG coherence protocol, respectively. We demonstrate the scalability of HALCONE using different GPU counts (2, 4, 8, and 16) and different CU counts (32, 48, and 64 CUs per GPU) for 11 standard benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题