论文标题
广告流:可靠语言模型微调归因驱动的辍学
AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning
论文作者
论文摘要
在有限的培训数据可用时,在下游任务上进行微调的大型预训练的语言模型很容易遭受过度匹配。虽然辍学是通过随机删除一定比例的单位而被证明是一种有效的解毒剂,但现有的研究尚未检查其对自我注意机制的影响。在本文中,我们通过自我注意的归因调查了这个问题,发现以低归因分数的降低注意力位置可以加速训练并增加过度拟合的风险。在这一观察结果的推动下,我们提出了归因驱动的辍学(AD-DROP),它随机放弃了一些高侵入位置,以鼓励该模型通过更多地依靠低侵入位置来减少过度拟合来做出预测。我们还制定了一种交叉调整策略,以交替进行微调和广告流程,以避免过度降低高侵占位置。各种基准的广泛实验表明,AD-DROP对基准产生一致的改进。分析进一步证实,AD-DROP是防止在微调过程中过度拟合的战略规则。
Fine-tuning large pre-trained language models on downstream tasks is apt to suffer from overfitting when limited training data is available. While dropout proves to be an effective antidote by randomly dropping a proportion of units, existing research has not examined its effect on the self-attention mechanism. In this paper, we investigate this problem through self-attention attribution and find that dropping attention positions with low attribution scores can accelerate training and increase the risk of overfitting. Motivated by this observation, we propose Attribution-Driven Dropout (AD-DROP), which randomly discards some high-attribution positions to encourage the model to make predictions by relying more on low-attribution positions to reduce overfitting. We also develop a cross-tuning strategy to alternate fine-tuning and AD-DROP to avoid dropping high-attribution positions excessively. Extensive experiments on various benchmarks show that AD-DROP yields consistent improvements over baselines. Analysis further confirms that AD-DROP serves as a strategic regularizer to prevent overfitting during fine-tuning.