论文标题
双赢的交易:迈向稀疏且强大的预训练的语言模型
A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models
论文作者
论文摘要
尽管预训练的语言模型(PLM)取得了显着的成功,但它们仍然面临两个挑战:首先,大规模的PLM在记忆足迹和计算方面效率低下。其次,在下游任务上,PLM倾向于依靠数据集偏差并难以推广到分布(OOD)数据。针对效率问题,最近的研究表明,密集的PLM可以用稀疏的子网代替而不会损害性能。在三种情况下可以找到这样的子网:1)微调的PLM,2)RAW PLMS,然后隔离进行微调,甚至在3)PLM中,而没有任何参数进行微调。但是,这些结果仅在分布(ID)设置中获得。在本文中,我们将有关PLMS子网的研究扩展到OOD设置,研究是否可以同时实现对数据集偏见的稀疏性和鲁棒性。为此,我们对三种自然语言理解(NLU)任务的预训练的BERT模型进行了广泛的实验。我们的结果表明,使用不同的训练和压缩方法,可以在伯特(Bert)中始终发现\ textbf {稀疏和鲁棒的子网(srnets)(srnets)。此外,我们使用OOD信息探索了SRNET的上限,并证明\ textbf {存在稀疏且几乎没有偏见的Bert子网络}。最后,我们提出1)一项分析研究,该研究提供了有关如何促进SRNET搜索过程效率的见解,以及2)一种解决方案,以提高子网高稀疏性的性能。该代码可在https://github.com/llyx97/sparse-and-robust-plm上找到。
Despite the remarkable success of pre-trained language models (PLMs), they still face two challenges: First, large-scale PLMs are inefficient in terms of memory footprint and computation. Second, on the downstream tasks, PLMs tend to rely on the dataset bias and struggle to generalize to out-of-distribution (OOD) data. In response to the efficiency problem, recent studies show that dense PLMs can be replaced with sparse subnetworks without hurting the performance. Such subnetworks can be found in three scenarios: 1) the fine-tuned PLMs, 2) the raw PLMs and then fine-tuned in isolation, and even inside 3) PLMs without any parameter fine-tuning. However, these results are only obtained in the in-distribution (ID) setting. In this paper, we extend the study on PLMs subnetworks to the OOD setting, investigating whether sparsity and robustness to dataset bias can be achieved simultaneously. To this end, we conduct extensive experiments with the pre-trained BERT model on three natural language understanding (NLU) tasks. Our results demonstrate that \textbf{sparse and robust subnetworks (SRNets) can consistently be found in BERT}, across the aforementioned three scenarios, using different training and compression methods. Furthermore, we explore the upper bound of SRNets using the OOD information and show that \textbf{there exist sparse and almost unbiased BERT subnetworks}. Finally, we present 1) an analytical study that provides insights on how to promote the efficiency of SRNets searching process and 2) a solution to improve subnetworks' performance at high sparsity. The code is available at https://github.com/llyx97/sparse-and-robust-PLM.