论文标题
逐步堆叠2.0:用于BERT训练速度的多阶段层训练方法
Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup
论文作者
论文摘要
在许多自然语言处理任务中,预训练的语言模型(例如BERT)已获得了明显的准确性提高。尽管具有有效性,但大量参数使训练在计算上是一个非常挑战的BERT模型。在本文中,我们提出了一种有效的多阶段层训练(MSLT)方法,以减少BERT的训练时间。我们将整个训练过程分解为多个阶段。训练始于一个小型模型,只有几个编码器层,我们通过添加新的编码器层逐渐增加了模型的深度。在每个阶段,我们仅训练顶部(在输出层附近)几个新添加的编码层。在当前阶段将不会更新以前阶段训练的其他层的参数。在BERT训练中,向后计算比远期计算更耗时,尤其是在分布式训练设置中,向后计算时间进一步包括梯度同步的通信时间。在拟议的培训策略中,只有前几层参与向后计算,而大多数层仅参与远期计算。因此,计算和通信效率都大大提高。实验结果表明,所提出的方法可以实现超过110%的训练速度,而不会显着降级。
Pre-trained language models, such as BERT, have achieved significant accuracy gain in many natural language processing tasks. Despite its effectiveness, the huge number of parameters makes training a BERT model computationally very challenging. In this paper, we propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT. We decompose the whole training process into several stages. The training is started from a small model with only a few encoder layers and we gradually increase the depth of the model by adding new encoder layers. At each stage, we only train the top (near the output layer) few encoder layers which are newly added. The parameters of the other layers which have been trained in the previous stages will not be updated in the current stage. In BERT training, the backward computation is much more time-consuming than the forward computation, especially in the distributed training setting in which the backward computation time further includes the communication time for gradient synchronization. In the proposed training strategy, only top few layers participate in backward computation, while most layers only participate in forward computation. Hence both the computation and communication efficiencies are greatly improved. Experimental results show that the proposed method can achieve more than 110% training speedup without significant performance degradation.