论文标题

优化模型并行性的通信

On Optimizing the Communication of Model Parallelism

论文作者

Zhuang, Yonghao, Zhao, Hexu, Zheng, Lianmin, Li, Zhuohan, Xing, Eric P., Ho, Qirong, Gonzalez, Joseph E., Stoica, Ion, Zhang, Hao

论文摘要

我们研究了大型模型平行深度学习(DL)中新颖而重要的沟通模式,我们称之为交叉网。当模型并行性的两个范式(操作员和操作员互行)合并以支持大型群集上的大型模型时,就会出现这种模式。在交叉网状重新安装中,需要从源设备网眼发送碎片张量到目标设备网格,该设备网格可以通过相同或不同的布局分配张量。我们将其形式化为多对多的多播沟通问题,并表明现有方法是次优最佳的,或者不推广到不同的网络拓扑或张量布局,这是由不同的模型体系结构和并行策略造成的。然后,我们提出了两项​​贡献,以解决跨网格重新设备:一个有效的基于广播的通信系统,以及“重叠友好的”管道时间表。在Microbenchs上,我们的整体系统在各种张量和网格布局上的表现高达10倍。在对GPT-3和U-Transformer的两个大型型号的端到端培训中,我们分别将吞吐量提高了10%和50%。

We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters. In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh, on which the tensor may be distributed with the same or different layouts. We formalize this as a many-to-many multicast communication problem, and show that existing approaches either are sub-optimal or do not generalize to different network topologies or tensor layouts, which result from different model architectures and parallelism strategies. We then propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule. On microbenchmarks, our overall system outperforms existing ones by up to 10x across various tensor and mesh layouts. On end-to-end training of two large models, GPT-3 and U-Transformer, we improve throughput by 10% and 50%, respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源