电瓦尔龙：使用自动并行性在多个GPU上进行有效的变压器训练

论文标题

电瓦尔龙：使用自动并行性在多个GPU上进行有效的变压器训练

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

论文作者

Miao, Xupeng, Wang, Yujie, Jiang, Youhe, Shi, Chunan, Nie, Xiaonan, Zhang, Hailin, Cui, Bin

论文摘要

变压器模型已在应用程序的各个领域实现了最新的性能，并逐渐成为高级大型学习（DL）模型的基础。但是，由于大量的并行性选择，如何有效地在多个GPU上训练这些模型仍然具有挑战性。现有的DL系统要么依靠手动努力来制定分布式培训计划，要么在非常有限的搜索空间内应用并行性组合。在这种方法中，我们提出了Galvatron，这是一个新的系统框架，该框架结合了多个流行的并行性维度，并自动找到了最有效的混合并行性策略。为了更好地探索如此巨大的搜索空间，我们1）涉及一个决策树，以基于某些合理的直觉进行分解和修剪，然后2）设计动态的编程搜索算法以生成最佳计划。对四个代表性变压器工作负载的评估表明，Galvatron可以使用不同的GPU内存预算进行自动分发培训。在所有浮出水面的场景中，与以前的平行性有限的工作相比，Galvatron始终可以实现出色的系统吞吐量。

Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make distributed training plans or apply parallelism combinations within a very limited search space. In this approach, we propose Galvatron, a new system framework that incorporates multiple popular parallelism dimensions and automatically finds the most efficient hybrid parallelism strategy. To better explore such a rarely huge search space, we 1) involve a decision tree to make decomposition and pruning based on some reasonable intuitions, and then 2) design a dynamic programming search algorithm to generate the optimal plan. Evaluations on four representative Transformer workloads show that Galvatron could perform automatically distributed training with different GPU memory budgets. Among all evluated scenarios, Galvatron always achieves superior system throughput compared to previous work with limited parallelism.

下载PDF全文

下载文献需遵守相关版权规定

论文标题