论文标题
优化模型并行性的通信
On Optimizing the Communication of Model Parallelism
论文作者
论文摘要
我们研究了大型模型平行深度学习(DL)中新颖而重要的沟通模式,我们称之为交叉网。当模型并行性的两个范式(操作员和操作员互行)合并以支持大型群集上的大型模型时,就会出现这种模式。在交叉网状重新安装中,需要从源设备网眼发送碎片张量到目标设备网格,该设备网格可以通过相同或不同的布局分配张量。我们将其形式化为多对多的多播沟通问题,并表明现有方法是次优最佳的,或者不推广到不同的网络拓扑或张量布局,这是由不同的模型体系结构和并行策略造成的。然后,我们提出了两项贡献,以解决跨网格重新设备:一个有效的基于广播的通信系统,以及“重叠友好的”管道时间表。在Microbenchs上,我们的整体系统在各种张量和网格布局上的表现高达10倍。在对GPT-3和U-Transformer的两个大型型号的端到端培训中,我们分别将吞吐量提高了10%和50%。
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters. In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh, on which the tensor may be distributed with the same or different layouts. We formalize this as a many-to-many multicast communication problem, and show that existing approaches either are sub-optimal or do not generalize to different network topologies or tensor layouts, which result from different model architectures and parallelism strategies. We then propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule. On microbenchmarks, our overall system outperforms existing ones by up to 10x across various tensor and mesh layouts. On end-to-end training of two large models, GPT-3 and U-Transformer, we improve throughput by 10% and 50%, respectively.