彗星：用于分布式深度学习培训的全面集群设计方法

论文标题

彗星：用于分布式深度学习培训的全面集群设计方法

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

论文作者

Kadiyala, Divya Kiran, Rashidi, Saeed, Heo, Taekyung, Bambhaniya, Abhimanyu Rajeshkumar, Krishna, Tushar, Daglis, Alexandros

论文摘要

现代深度学习（DL）模型已经成长为大小，需要大量的专业，高端节点才能训练。设计此类簇以最大程度地提高性能和利用率 - 摊销其陡峭的成本 - 这是一项具有挑战性的任务，需要仔细的计算，内存和网络资源平衡。此外，每个模型的调整旋钮都会极大地影响性能，最佳值通常取决于基础群集的特征，这需要一个复杂的群集 - 工作负载共同设计过程。为了促进此类大规模DL培训簇的设计空间探索，我们介绍了彗星，这是一种整体群集设计方法和工作流程，以共同研究并行化策略和关键集群资源提供对分布式DL培训性能的影响。我们开发了一个逐步的过程来建立可重复和灵活的方法，并通过培训大型模型的案例研究在可变计算，内存和网络资源的集群配置上进行了应用。我们的案例研究表明，彗星在识别有希望的体系结构优化方向以及指导系统设计人员配置关键模型和群集参数方面的实用性。为了说明，集群配置比较确定高达7.7倍的性能差异，并在使用内存扩展作为优化技术时突出显示高达1.4倍的性能优化机会。

Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task requiring careful balance of compute, memory, and network resources. Moreover, a plethora of each model's tuning knobs drastically affect the performance, with optimal values often depending on the underlying cluster's characteristics, which necessitates a complex cluster-workload co-design process. To facilitate the design space exploration of such massive DL training clusters, we introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training. We develop a step-by-step process to establish a reusable and flexible methodology, and demonstrate its application with case studies of training large models on cluster configurations of variable compute, memory, and network resources. Our case studies demonstrate COMET's utility in identifying promising architectural optimization directions and guiding system designers in configuring key model and cluster parameters. To illustrate, cluster configuration comparisons identify performance differences of up to 7.7x and highlight performance optimization opportunities of up to 1.4x when employing memory expansion as an optimization technique.

下载PDF全文

下载文献需遵守相关版权规定

论文标题