基于内核的语言模型微调的观点

论文标题

基于内核的语言模型微调的观点

A Kernel-Based View of Language Model Fine-Tuning

论文作者

Malladi, Sadhika, Wettig, Alexander, Yu, Dingli, Chen, Danqi, Arora, Sanjeev

论文摘要

通过微调预训练的语言模型（LMS）来解决NLP任务已成为标准，尤其是在低数据设置中。对经验成功的理论理解最少，例如，为什么在几十个训练点上以$ 10^8 $或更多参数进行微调模型不会导致过度拟合。我们研究了神经切线核（NTK）是否起源于研究具有适当随机初始化的无限宽网络的梯度下降动力学的模型 - 描述了预训练的LMS的微调。这项研究的灵感来自NTK在计算机视觉任务中的不错的表现（Wei等，2022）。我们将NTK形式主义扩展到亚当，并使用张量程序（Yang，2020）来表征NTK镜头可以描述对预训练的语言模型的微调更新的条件。对14个NLP任务进行的广泛实验验证了我们的理论，并表明，通过提示在微调过程中促使下游任务作为掩盖的单词预测问题通常会诱导基于内核的动力学。最后，我们使用此内核视图为基于参数效率亚空间的微调方法的成功提供了解释。

It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK) - which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization - describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题