论文标题
当伯特演奏彩票时,所有门票都赢了
When BERT Plays the Lottery, All Tickets Are Winning
论文作者
论文摘要
大型基于变压器的模型被证明可还原为较少数量的自我注意力头和层。我们使用结构化和幅度修剪来考虑从彩票假设的角度考虑这种现象。对于微调的BERT,我们表明(a)可以找到与完整模型相当的性能的子网,以及(b)从模型其余模型中采样类似大小的子网络的性能。引人注目的是,结构化修剪即使是最坏的子网也可以训练,这表明大多数预训练的BERT权重可能有用。我们还研究了“良好”子网,以查看它们的成功是否可以归因于优越的语言知识,但发现它们不稳定,而不是通过有意义的自我注意力模式来解释。
Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the "good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns.