Moesys：互联网服务的分布式和高效的专家培训和推理系统

论文标题

Moesys：互联网服务的分布式和高效的专家培训和推理系统

MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

论文作者

Yu, Dianhai, Shen, Liang, Hao, Hongxiang, Gong, Weibao, Wu, Huachao, Bian, Jiang, Dai, Lirong, Xiong, Haoyi

论文摘要

尽管现代互联网服务（例如聊天机器人，搜索引擎和在线广告）需要使用大规模深神经网络（DNN），分布式培训和对异质计算系统的推断，以促进这些DNN模型。 Experts（MOE）的混合物是最常见的策略之一，即通过以分裂和构成方式进行门控和并行性，降低培训成本，但要降低模型/数据的整体规模。尽管DeepSpeed已努力对异质基础设施进行大规模的MOE培训，但培训和推理的效率可以从几个系统方面进一步提高，包括负载平衡，通信/计算效率以及内存足迹限制。在这项工作中，我们提出了一种新颖的Moesys，可以提高大规模训练和推理的效率。具体而言，在培训程序中，拟议的Moesys采用了弹性的MOE培训策略，并在层次存储上进行了2D预取和融合通信，以享受有效的并行性。对于单个节点中的可扩展推断，尤其是当模型大小大于GPU内存时，Moesys将CPU-GPU存储器共同构建到一个截面的环中以加载模型，并以圆形旋转的方式在整个内存段中执行计算任务，以进行有效的推理。我们进行了广泛的实验来评估Moesys，其中Moesys成功训练了统一的功能优化（UFO）模型，并在48 A100 GPU卡上在8天内使用稀疏门控的12B参数的Experts模型进行了训练。与最先进的比较表明，Moesys在训练中的吞吐量高33％（令牌每秒），总体上的吞吐量提高了13％。特别是，在不平衡的MOE任务（例如UFO）下，Moesys的吞吐量增加了64％，记忆力低下。

While modern internet services, such as chatbots, search engines, and online advertising, demand the use of large-scale deep neural networks (DNNs), distributed training and inference over heterogeneous computing systems are desired to facilitate these DNN models. Mixture-of-Experts (MoE) is one the most common strategies to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present a novel MoESys that boosts efficiency in both large-scale training and inference. Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms. For scalable inference in a single node, especially when the model size is larger than GPU memory, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate MoESys, where MoESys successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that MoESys outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, MoESys achieved 64% higher throughput with 18% lower memory footprints.

下载PDF全文

下载文献需遵守相关版权规定

论文标题