论文标题
预先培训的BERT网络的彩票票证假设
The Lottery Ticket Hypothesis for Pre-trained BERT Networks
论文作者
论文摘要
在自然语言处理(NLP)中,诸如BERT之类的巨大预训练模型已成为培训一系列下游任务的标准起点,并且在其他深度学习的其他领域也出现了类似的趋势。同时,在彩票票证上进行的工作表明,NLP和计算机视觉的模型包含较小的匹配子网,能够隔离训练以完全准确并转移到其他任务。在这项工作中,我们结合了这些观察结果,以评估预先训练的BERT模型中是否存在可训练的可转移子网。对于一系列下游任务,我们确实发现匹配子网的匹配度为40%至90%。我们在(预训练的)初始化中发现这些子网,这是与先前的NLP研究的偏差,仅在经过一定量的培训后才出现。在蒙版语言建模任务(用于预先训练模型的相同任务)上找到的子网络被普遍传输;如果有限的话,那些在其他任务转移的情况下找到的。随着大规模的预训练成为深度学习中越来越重要的中心范式,我们的结果表明,在这种情况下,主要的彩票观察结果仍然相关。可在https://github.com/vita-group/bert-tickets上获得代码。
In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training on a range of downstream tasks, and similar trends are emerging in other areas of deep learning. In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matching subnetworks capable of training in isolation to full accuracy and transferring to other tasks. In this work, we combine these observations to assess whether such trainable, transferrable subnetworks exist in pre-trained BERT models. For a range of downstream tasks, we indeed find matching subnetworks at 40% to 90% sparsity. We find these subnetworks at (pre-trained) initialization, a deviation from prior NLP research where they emerge only after some amount of training. Subnetworks found on the masked language modeling task (the same task used to pre-train the model) transfer universally; those found on other tasks transfer in a limited fashion if at all. As large-scale pre-training becomes an increasingly central paradigm in deep learning, our results demonstrate that the main lottery ticket observations remain relevant in this context. Codes available at https://github.com/VITA-Group/BERT-Tickets.