使用DeepSpeed和Megatron训练Megatron-Turing NLG 530B，这是一种大规模的生成语言模型

论文标题

使用DeepSpeed和Megatron训练Megatron-Turing NLG 530B，这是一种大规模的生成语言模型

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

论文作者

Smith, Shaden, Patwary, Mostofa, Norick, Brandon, LeGresley, Patrick, Rajbhandari, Samyam, Casper, Jared, Liu, Zhun, Prabhumoye, Shrimai, Zerveas, George, Korthikanti, Vijay, Zhang, Elton, Child, Rewon, Aminabadi, Reza Yazdani, Bernauer, Julie, Song, Xia, Shoeybi, Mohammad, He, Yuxiong, Houston, Michael, Tiwary, Saurabh, Catanzaro, Bryan

论文摘要

预处理的通用语言模型可以通过零射击，少量和微调技术适应下游任务，从而在各种自然语言处理领域中实现最新精确度。由于它们的成功，这些模型的规模迅速增加，需要高性能的硬件，软件和算法技术才能实现此类大型模型。由于Microsoft和Nvidia之间的共同努力，我们介绍了最大的基于整体变压器的语言模型Megatron-Turing NLG 530B（MT-NLG）的培训的详细信息，其中有5300亿个参数。在本文中，我们首先关注基础架构以及用于使用DeepSpeed和Megatron训练该模型的3D平行性方法。接下来，我们详细介绍培训过程，培训语料库的设计以及数据策划技术，我们认为这是模型成功的关键要素。最后，我们讨论了各种评估结果，以及MT-NLG展示的其他有趣的观察结果和新属性。我们证明，MT-NLG在几个NLP基准上实现了零，一单和很少的学习精度，并建立了新的最新结果。我们认为，我们的贡献将有助于进一步发展大型培训基础设施，大规模语言模型和自然语言世代。

Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题