同样的训练前损失，更好的下游：隐性偏见对于语言模型很重要

论文标题

同样的训练前损失，更好的下游：隐性偏见对于语言模型很重要

Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models

论文作者

Liu, Hong, Xie, Sang Michael, Li, Zhiyuan, Ma, Tengyu

论文摘要

大规模数据集上的语言建模可在各种下游语言任务上获得令人印象深刻的性能增长。在开发语言模型时，验证预训练损失（或自回归语言建模中的困惑）通常用作评估指标，因为预训练损失往往与下游性能相关（本身很难全面评估）。与这种传统的观点相反，本文表明1）预训练损失无法完全解释下游性能，而2）模型的平坦度与前下游性能非常相关，而预训练损失不是。在简化的数据集中，我们确定了三种方法来生产具有相同（统计上最佳的）预训练损失但下游性能不同的模型：在收敛后继续预训练，增加模型大小并更改训练算法。这些实验证明了训练算法/优化器的隐性偏差存在 - 在具有相同最小训练损失的模型中，它们隐含地更喜欢更可转移的损失。为了理解这种隐式偏见，我们证明具有标准迷你批量噪声的SGD隐含地更喜欢语言模型中的最小值，并且经验观察到平坦度与具有相同最小训练损失的模型之间的平坦度与下游性能之间的相关性很强。我们还以合成语言设置证明，在具有最小训练损失的模型中，最坦率的模型转移到了下游任务。

Language modeling on large-scale datasets leads to impressive performance gains on various downstream language tasks. The validation pre-training loss (or perplexity in autoregressive language modeling) is often used as the evaluation metric when developing language models since the pre-training loss tends to be well-correlated with downstream performance (which is itself difficult to evaluate comprehensively). Contrary to this conventional wisdom, this paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not. On simplified datasets, we identify three ways to produce models with the same (statistically optimal) pre-training loss but different downstream performance: continue pre-training after convergence, increasing the model size, and changing the training algorithm. These experiments demonstrate the existence of implicit bias of pre-training algorithms/optimizers -- among models with the same minimal pre-training loss, they implicitly prefer more transferable ones. Toward understanding this implicit bias, we prove that SGD with standard mini-batch noise implicitly prefers flatter minima in language models, and empirically observe a strong correlation between flatness and downstream performance among models with the same minimal pre-training loss. We also prove in a synthetic language setting that among the models with the minimal pre-training loss, the flattest model transfers to downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题