灵活：深度学习的轻巧和平行的GPU任务计划

论文标题

灵活：深度学习的轻巧和平行的GPU任务计划

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

论文作者

Kwon, Woosuk, Yu, Gyeong-In, Jeong, Eunji, Chun, Byung-Gon

论文摘要

深度学习（DL）框架利用GPU来提高DL推理和训练的速度。理想情况下，DL框架应该能够充分利用GPU的计算能力，以便运行时间取决于分配给GPU的计算量。但是，我们观察到，在安排GPU任务时，现有的DL框架遭受效率低下的障碍，例如大型计划开销和不必要的串行执行。为此，我们提出了Nimble，这是一种DL执行引擎，该执行引擎与最小的调度开销同时运行GPU任务。 Nimble介绍了一种新颖的技术，称为提前（AOT）计划。在这里，调度过程在执行GPU内核之前完成了完成，从而在运行时间内删除了大部分的调度开销。此外，敏捷会自动通过在单个GPU中利用多个GPU流来使GPU任务的执行。对各种神经网络的评估表明，与Pytorch相比，敏捷的推理和培训分别高达22.34 $ \ times $和3.61 $ \ times $。此外，Nimble的表现优于最先进的推理系统，Tensorrt和TVM，高达2.81 $ \ times $和1.70 $ \ times $。

Deep learning (DL) frameworks take advantage of GPUs to improve the speed of DL inference and training. Ideally, DL frameworks should be able to fully utilize the computation power of GPUs such that the running time depends on the amount of computation assigned to GPUs. Yet, we observe that in scheduling GPU tasks, existing DL frameworks suffer from inefficiencies such as large scheduling overhead and unnecessary serial execution. To this end, we propose Nimble, a DL execution engine that runs GPU tasks in parallel with minimal scheduling overhead. Nimble introduces a novel technique called ahead-of-time (AoT) scheduling. Here, the scheduling procedure finishes before executing the GPU kernel, thereby removing most of the scheduling overhead during run time. Furthermore, Nimble automatically parallelizes the execution of GPU tasks by exploiting multiple GPU streams in a single GPU. Evaluation on a variety of neural networks shows that compared to PyTorch, Nimble speeds up inference and training by up to 22.34$\times$ and 3.61$\times$, respectively. Moreover, Nimble outperforms state-of-the-art inference systems, TensorRT and TVM, by up to 2.81$\times$ and 1.70$\times$, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题