参数有效的转移学习与DIFF修剪

论文标题

参数有效的转移学习与DIFF修剪

Parameter-Efficient Transfer Learning with Diff Pruning

论文作者

Guo, Demi, Rush, Alexander M., Kim, Yoon

论文摘要

尽管经过预审慎的网络的特定任务填充导致了NLP的重大经验进步，但大型网络使得填充很难在多任务，内存约束的设置中部署。我们提出DIFF修剪作为一种简单的方法，以在预处理框架内启用参数有效的转移学习。这种方法将列式调查视为学习特定于任务的DIFF向量，该向量是在预验证的参数向量上应用的，该参数向量保持固定并在不同的任务中共享。在训练期间，差异向量会自适应地进行修剪，并具有与L0-Norm惩罚的可区分近似，以鼓励稀疏性。随着任务数量的增加，DIFF修剪会变得有效，因为它仅需要存储每个任务的diff矢量的非零位置和权重，而存储共享预估计的模型的成本保持恒定。它进一步不需要在培训期间访问所有任务，这使其在任务到达流或任务集的设置中很有吸引力。我们发现，用DIFF修剪进行固定的模型可以与胶水基准上完全易经的基线的性能相匹配，同时仅修改每个任务预审预周边模型的参数的0.5％。

While task-specific finetuning of pretrained networks has led to significant empirical advances in NLP, the large size of networks makes finetuning difficult to deploy in multi-task, memory-constrained settings. We propose diff pruning as a simple approach to enable parameter-efficient transfer learning within the pretrain-finetune framework. This approach views finetuning as learning a task-specific diff vector that is applied on top of the pretrained parameter vector, which remains fixed and is shared across different tasks. The diff vector is adaptively pruned during training with a differentiable approximation to the L0-norm penalty to encourage sparsity. Diff pruning becomes parameter-efficient as the number of tasks increases, as it requires storing only the nonzero positions and weights of the diff vector for each task, while the cost of storing the shared pretrained model remains constant. It further does not require access to all tasks during training, which makes it attractive in settings where tasks arrive in stream or the set of tasks is unknown. We find that models finetuned with diff pruning can match the performance of fully finetuned baselines on the GLUE benchmark while only modifying 0.5% of the pretrained model's parameters per task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题