冲击波：机器学习中动态适应的公平有效的集群计划

论文标题

冲击波：机器学习中动态适应的公平有效的集群计划

Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning

论文作者

Zheng, Pengfei, Pan, Rui, Khan, Tarannum, Venkataraman, Shivaram, Akella, Aditya

论文摘要

动态适应已成为加速分布式机器学习（ML）培训的重要技术。最近的研究表明，动态调整模型结构（例如彩票假设）或超参数（例如，批量尺寸）可以显着加速训练而无需牺牲准确性。但是，现有的ML群集调度程序并非旨在处理动态适应。我们表明，当训练吞吐量在动态适应下随时间变化时，现有方案无法提供公平性和降低系统效率。我们设计了Shockwave，这是一个计划的调度程序，其未来计划基于两个关键想法。首先，ShockWave将经典的市场理论从静态设置扩展到动态设置，以使效率和公平性达到优化。其次，冲击波利用随机动态编程来处理动态变化。我们构建了一个用于冲击波的系统，并通过痕量驱动的仿真和群集实验来验证其性能。结果表明，对于具有动态适应的ML作业的痕迹，与现有的公平调度计划相比，ShockWave将MakePan提高了1.3倍，公平性提高了2倍。

Dynamic adaptation has become an essential technique in accelerating distributed machine learning (ML) training. Recent studies have shown that dynamically adjusting model structure (e.g., lottery ticket hypothesis) or hyperparameters (e.g., batch size) can significantly accelerate training without sacrificing accuracy. However, existing ML cluster schedulers are not designed to handle dynamic adaptation. We show that existing schemes fail to provide fairness and degrade system efficiency when the training throughput changes over time under dynamic adaptation. We design Shockwave, a scheduler with future planning that builds on two key ideas. First, Shockwave extends classic market theory from static settings to dynamic settings to co-optimize efficiency and fairness. Second, Shockwave utilizes stochastic dynamic programming to handle dynamic changes. We build a system for Shockwave and validate its performance with both trace-driven simulation and cluster experiments. Results show that for traces of ML jobs with dynamic adaptation, Shockwave improves makespan by 1.3X and fairness by 2X when compared with existing fair scheduling schemes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题