通过减少代表性崩溃来更好地进行微调

论文标题

通过减少代表性崩溃来更好地进行微调

Better Fine-Tuning by Reducing Representational Collapse

论文作者

Aghajanyan, Armen, Shrivastava, Akshat, Gupta, Anchit, Goyal, Naman, Zettlemoyer, Luke, Gupta, Sonal

论文摘要

尽管被广泛采用，但现有的微调预培训语言模型的方法已显示在超参数设置中是不稳定的，激发了最近在信任区域方法上的工作。在本文中，我们提出了一种植根于信任区域理论的简化和有效的方法，该方法用参数噪声（从正常分布或统一的分布中取样）代替了先前使用的对抗性目标，从而在可能的情况下不受损害性能而在微调过程中劝阻表示在微调过程中发生变化。我们还引入了一项新的分析，以通过研究代表性崩溃来更普遍地激发信任区域方法的使用；可以通过对特定最终任务进行微调的预培训模型的概括表示降解。广泛的实验表明，我们的微调方法匹配或超过先前信任区域方法在一系列理解和生成任务上的性能（包括Dailymail/CNN，Gigaword，Reddit Tifu和Glue Benchmark），同时也更快。我们还表明，它不太容易崩溃。预训练的模型每次对其进行微调时都保持更广泛的表示。

Although widely adopted, existing approaches for fine-tuning pre-trained language models have been shown to be unstable across hyper-parameter settings, motivating recent work on trust region methods. In this paper, we present a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance. We also introduce a new analysis to motivate the use of trust region methods more generally, by studying representational collapse; the degradation of generalizable representations from pre-trained models as they are fine-tuned for a specific end task. Extensive experiments show that our fine-tuning method matches or exceeds the performance of previous trust region methods on a range of understanding and generation tasks (including DailyMail/CNN, Gigaword, Reddit TIFU, and the GLUE benchmark), while also being much faster. We also show that it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.

下载PDF全文

下载文献需遵守相关版权规定

论文标题