CRAC：带有流和UVM的CUDA的Checkpoint-Restart架构

论文标题

CRAC：带有流和UVM的CUDA的Checkpoint-Restart架构

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

论文作者

Jain, Twinkle, Cooperman, Gene

论文摘要

NVIDIA GPU的前500名超级计算机的份额现在超过25％，并且继续增长。虽然容错是超级计算的关键问题，但目前尚不存在针对NVIDIA GPU上的CUDA应用程序的有效，可扩展的解决方案。 CRAC（CUDA的Checkpoint-Restart体系结构）是支持全部CUDA应用程序范围的错误公差的新检查点 - 总结解决方案。 CRAC组合：低运行时开销（约1％或更少）；快速检查点 - 积分；支持可扩展的CUDA流（以有效地使用所有数千个GPU核）；并支持统一虚拟内存的全部功能（消除了程序员在设备和主机之间迁移内存的负担）。 CRAC通过隔离应用程序代码（检查点）及其外部GPU通信来实现其灵活的体系结构，并在单个进程的内存中通过非伦敦CUDA库（未检查）来实现其灵活的架构。这消除了早期方法中过程间通信的高间接开销，并且局限性较小。

The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues to grow. While fault tolerance is a critical issue for supercomputing, there does not currently exist an efficient, scalable solution for CUDA applications on NVIDIA GPUs. CRAC (Checkpoint-Restart Architecture for CUDA) is new checkpoint-restart solution for fault tolerance that supports the full range of CUDA applications. CRAC combines: low runtime overhead (approximately 1% or less); fast checkpoint-restart; support for scalable CUDA streams (for efficient usage of all of the thousands of GPU cores); and support for the full features of Unified Virtual Memory (eliminating the programmer's burden of migrating memory between device and host). CRAC achieves its flexible architecture by segregating application code (checkpointed) and its external GPU communication via non-reentrant CUDA libraries (not checkpointed) within a single process's memory. This eliminates the high overhead of inter-process communication in earlier approaches, and has fewer limitations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题