Dynabert：动态伯特具有自适应宽度和深度

论文标题

Dynabert：动态伯特具有自适应宽度和深度

DynaBERT: Dynamic BERT with Adaptive Width and Depth

论文作者

Hou, Lu, Huang, Zhiqi, Shang, Lifeng, Jiang, Xin, Chen, Xiao, Liu, Qun

论文摘要

诸如BERT之类的预训练的语言模型虽然在许多自然语言处理任务中有力，但既计算又昂贵。为了减轻这个问题，一种方法是在部署前将其压缩为特定任务。但是，关于BERT压缩的最新作品通常将大型BERT模型压缩到固定的较小尺寸。他们无法通过各种硬件性能完全满足不同边缘设备的要求。在本文中，我们提出了一种新型的动态BERT模型（缩写为Dynabert），可以通过选择自适应宽度和深度来灵活地调节大小和潜伏期。 Dynabert的训练过程包括首先训练宽度自适应的BERT，然后通过将知识从全尺寸模型提炼为小子网络，从而允许自适应宽度和深度。网络重新布线还用于保持更多子网络共享更重要的注意力头和神经元。在各种效率约束下进行的全面实验表明，我们提出的最大尺寸的动态BERT（或Roberta）的性能与Bert-base（或Roberta-Base）相当，而在较小的宽度和深度下，始终超过现有的BERT压缩方法。代码可在https://github.com/huawei-noah/pretrataining-language-model/tree/master/master/dynabert获得。

The pre-trained language models like BERT, though powerful in many natural language processing tasks, are both computation and memory expensive. To alleviate this problem, one approach is to compress them for specific tasks before deployment. However, recent works on BERT compression usually compress the large BERT model to a fixed smaller size. They can not fully satisfy the requirements of different edge devices with various hardware performances. In this paper, we propose a novel dynamic BERT model (abbreviated as DynaBERT), which can flexibly adjust the size and latency by selecting adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allowing both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. Comprehensive experiments under various efficiency constraints demonstrate that our proposed dynamic BERT (or RoBERTa) at its largest size has comparable performance as BERT-base (or RoBERTa-base), while at smaller widths and depths consistently outperforms existing BERT compression methods. Code is available at https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/DynaBERT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题