微调有多好？学习有效的语言模型

论文标题

微调有多好？学习有效的语言模型

How fine can fine-tuning be? Learning efficient language models

论文作者

Radiya-Dixit, Evani, Wang, Xin

论文摘要

现在，通过越来越大的网络来实现语言理解任务的最新性能。当前的记录持有人具有数十亿个参数。鉴于在大规模未标记的文本语料库中预先训练的语言模型，只需要非常光监督的微调才能学习任务：微调步骤的数量通常比总参数计数低五个数量级。这是否意味着微调仅在参数空间中引入与预训练模型的微小差异？如果是这样，可以避免为每个任务存储和计算整个模型吗？在这项工作中，我们通过使用来自变形金刚（BERT）的双向编码器表示来解决这些问题。正如预期的那样，我们发现微调模型在参数空间中与预训练的模型接近，并且紧密度随一层而变化。我们表明，只需微调最关键的层就足够了。此外，我们发现在预训练模型的一组稀疏版本中，有许多好的解决方案。结果，可以通过简单地在预训练参数的某些层中设置一定数量的条目来实现巨大语言模型的微调，从而保存特定于任务的参数存储和计算成本。

State-of-the-art performance on language understanding tasks is now achieved with increasingly large networks; the current record holder has billions of parameters. Given a language model pre-trained on massive unlabeled text corpora, only very light supervised fine-tuning is needed to learn a task: the number of fine-tuning steps is typically five orders of magnitude lower than the total parameter count. Does this mean that fine-tuning only introduces small differences from the pre-trained model in the parameter space? If so, can one avoid storing and computing an entire model for each task? In this work, we address these questions by using Bidirectional Encoder Representations from Transformers (BERT) as an example. As expected, we find that the fine-tuned models are close in parameter space to the pre-trained one, with the closeness varying from layer to layer. We show that it suffices to fine-tune only the most critical layers. Further, we find that there are surprisingly many good solutions in the set of sparsified versions of the pre-trained model. As a result, fine-tuning of huge language models can be achieved by simply setting a certain number of entries in certain layers of the pre-trained parameters to zero, saving both task-specific parameter storage and computational cost.

下载PDF全文

下载文献需遵守相关版权规定

论文标题