关于现代语言模型的损失

论文标题

关于现代语言模型的损失

On Losses for Modern Language Models

论文作者

Aroca-Ouellette, Stephane, Rudzicz, Frank

论文摘要

伯特（Bert）通过对两个任务进行预训练（MASKED语言建模（MLM）和下一个句子预测（NSP）（NSP），对NLU基准进行了许多最先进的结果，后者受到了高度批评。在本文中，我们1）澄清NSP对BERT预训练的影响，2）探索14个可能的辅助预训练任务，其中7个对现代语言模型是新颖的，3）研究不同的方法，将多个任务包括到预训练中。我们表明，由于NSP的上下文分裂和浅色语义信号，NSP对训练有害。我们还确定了六个辅助预训练任务 - 句子排序，相邻的句子预测，TF预测，TF-IDF预测，快速变体和快速思想变体 - 优于纯MLM基线。最后，我们证明，在多任务预训练框架中使用多个任务比使用任何单个辅助任务提供了更好的结果。使用这些方法，我们使用少于四分之一的训练令牌在胶水基准上的表现均优于BERT基础。

BERT set many state-of-the-art results over varied NLU benchmarks by pre-training over two tasks: masked language modelling (MLM) and next sentence prediction (NSP), the latter of which has been highly criticized. In this paper, we 1) clarify NSP's effect on BERT pre-training, 2) explore fourteen possible auxiliary pre-training tasks, of which seven are novel to modern language models, and 3) investigate different ways to include multiple tasks into pre-training. We show that NSP is detrimental to training due to its context splitting and shallow semantic signal. We also identify six auxiliary pre-training tasks -- sentence ordering, adjacent sentence prediction, TF prediction, TF-IDF prediction, a FastSent variant, and a Quick Thoughts variant -- that outperform a pure MLM baseline. Finally, we demonstrate that using multiple tasks in a multi-task pre-training framework provides better results than using any single auxiliary task. Using these methods, we outperform BERT Base on the GLUE benchmark using fewer than a quarter of the training tokens.

下载PDF全文

下载文献需遵守相关版权规定

论文标题