论文标题

芳基:深度学习的弹性群集调度程序

Aryl: An Elastic Cluster Scheduler for Deep Learning

论文作者

Li, Jiamin, Xu, Hong, Zhu, Yibo, Liu, Zherui, Guo, Chuanxiong, Wang, Cong

论文摘要

公司建造单独的培训和推断GPU群集,以进行深度学习,并使用单独的调度程序来管理它们。这导致了训练和推理的问题:当交通负荷较低时,推理簇的使用较低;由于缺乏资源,培训工作通常会经历长时间的排队时间。我们介绍了一种新的集群调度程序Aryl来解决这些问题。 Aryl将贷款贷款借入贷款推理GPU服务器进行培训工作。它进一步利用了弹性缩放,从而扩展了培训工作的GPU分配,以更好地利用借贷资源。容量贷款和弹性缩放为集群管理带来了新的挑战。当需要返回借贷服务器时,我们需要最大程度地减少抢占的数量;当更多的GPU可用时,我们需要分配它们以弹性工作并最大程度地减少工作完成时间(JCT)。芳基使用原则的启发式方法解决了这些组合问题。它介绍了服务器抢先成本的概念,在服务器回收过程中贪婪地减少了。它进一步依赖于为弹性工作定义的JCT降低值,以解决调度问题作为多项选择背包问题。在64-GPU测试台和大规模模拟上的原型实现,具有15天的50,000多个生产作业的痕迹表明,芳基在平均排队时间和JCT中降低了1.53倍和1.50倍的减少,并且在无需集群调度量的情况下,芳基的群集使用量可以提高26.9%的群集使用情况。

Companies build separate training and inference GPU clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both training and inference: inference clusters have low GPU utilization when the traffic load is low; training jobs often experience long queueing time due to lack of resources. We introduce Aryl, a new cluster scheduler to address these problems. Aryl introduces capacity loaning to loan idle inference GPU servers for training jobs. It further exploits elastic scaling that scales a training job's GPU allocation to better utilize loaned resources. Capacity loaning and elastic scaling create new challenges to cluster management. When the loaned servers need to be returned, we need to minimize the number of job preemptions; when more GPUs become available, we need to allocate them to elastic jobs and minimize the job completion time (JCT). Aryl addresses these combinatorial problems using principled heuristics. It introduces the notion of server preemption cost which it greedily reduces during server reclaiming. It further relies on the JCT reduction value defined for each additional worker for an elastic job to solve the scheduling problem as a multiple-choice knapsack problem. Prototype implementation on a 64-GPU testbed and large-scale simulation with 15-day traces of over 50,000 production jobs show that Aryl brings 1.53x and 1.50x reductions in average queuing time and JCT, and improves cluster usage by up to 26.9% over the cluster scheduler without capacity loaning or elastic scaling.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源