关于渐进式BERT培训的变压器增长

论文标题

关于渐进式BERT培训的变压器增长

On the Transformer Growth for Progressive BERT Training

论文作者

Gu, Xiaotao, Liu, Liyuan, Yu, Hongkun, Li, Jing, Chen, Chen, Han, Jiawei

论文摘要

由于大规模语言模型预训练的成本过高，因此已经逐步训练伯特（Bert）逐步训练 - 从较低但低成本的模型开始，并逐渐发展该模型以提高计算复杂性。我们的目标是促进对变压器增长的理解，并发现指导渐进培训的原则。首先，我们发现类似于网络体系结构搜索，变压器增长也有利于复合缩放。具体而言，尽管现有方法仅在单个维度上进行网络增长，但我们观察到使用复合增长算子并平衡多个维度（例如，模型的深度，宽度和输入长度）是有益的。此外，我们通过受控比较探索每个维度中的替代增长操作员，以提供操作员选择实践指导。鉴于我们的分析，所提出的方法分别为基础和大型模型加快了BERT预训练的73.6％和82.2％，同时实现了可比的性能

Due to the excessive cost of large-scale language model pre-training, considerable efforts have been made to train BERT progressively -- start from an inferior but low-cost model and gradually grow the model to increase the computational complexity. Our objective is to advance the understanding of Transformer growth and discover principles that guide progressive training. First, we find that similar to network architecture search, Transformer growth also favors compound scaling. Specifically, while existing methods only conduct network growth in a single dimension, we observe that it is beneficial to use compound growth operators and balance multiple dimensions (e.g., depth, width, and input length of the model). Moreover, we explore alternative growth operators in each dimension via controlled comparison to give operator selection practical guidance. In light of our analyses, the proposed method speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively, while achieving comparable performances

下载PDF全文

下载文献需遵守相关版权规定

论文标题