广告流：可靠语言模型微调归因驱动的辍学

论文标题

广告流：可靠语言模型微调归因驱动的辍学

AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning

论文作者

Yang, Tao, Deng, Jinghao, Quan, Xiaojun, Wang, Qifan, Nie, Shaoliang

论文摘要

在有限的培训数据可用时，在下游任务上进行微调的大型预训练的语言模型很容易遭受过度匹配。虽然辍学是通过随机删除一定比例的单位而被证明是一种有效的解毒剂，但现有的研究尚未检查其对自我注意机制的影响。在本文中，我们通过自我注意的归因调查了这个问题，发现以低归因分数的降低注意力位置可以加速训练并增加过度拟合的风险。在这一观察结果的推动下，我们提出了归因驱动的辍学（AD-DROP），它随机放弃了一些高侵入位置，以鼓励该模型通过更多地依靠低侵入位置来减少过度拟合来做出预测。我们还制定了一种交叉调整策略，以交替进行微调和广告流程，以避免过度降低高侵占位置。各种基准的广泛实验表明，AD-DROP对基准产生一致的改进。分析进一步证实，AD-DROP是防止在微调过程中过度拟合的战略规则。

Fine-tuning large pre-trained language models on downstream tasks is apt to suffer from overfitting when limited training data is available. While dropout proves to be an effective antidote by randomly dropping a proportion of units, existing research has not examined its effect on the self-attention mechanism. In this paper, we investigate this problem through self-attention attribution and find that dropping attention positions with low attribution scores can accelerate training and increase the risk of overfitting. Motivated by this observation, we propose Attribution-Driven Dropout (AD-DROP), which randomly discards some high-attribution positions to encourage the model to make predictions by relying more on low-attribution positions to reduce overfitting. We also develop a cross-tuning strategy to alternate fine-tuning and AD-DROP to avoid dropping high-attribution positions excessively. Extensive experiments on various benchmarks show that AD-DROP yields consistent improvements over baselines. Analysis further confirms that AD-DROP serves as a strategic regularizer to prevent overfitting during fine-tuning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题