论文标题
微调常识性语言模型是否真的概括了?
Do Fine-tuned Commonsense Language Models Really Generalize?
论文作者
论文摘要
最近,基于变压器的方法,例如罗伯塔(Roberta)和GPT-3,导致了自然语言处理任务的重大实验进步,例如问答和常识性推理。通常通过多个基准来评估后者作为前者的多项选择实例。根据艾伦学院(Allen Institute)主持的有影响力的排行榜(评估常识性推理基准的最先进性能),基于此类变压器方法的模型正在接近类似人类的性能,并且在许多基准测试中的平均准确性超过80%。由于这些是常识基准,因此在常识性推理上概括的模型不应在多个常识基准中经历太多的性能损失。在本文中,我们通过设计和进行严格的科学研究来详细研究概括问题。使用五个常见的基准测试,多个控件和统计分析,我们发现清晰的证据表明,即使对实验设置进行了适度的变化,微调常识性语言模型仍然无法概括,并且实际上可能会受到数据集偏见的影响。我们还进行了选择性研究,包括定性和一致性分析,以深入了解该问题。
Recently, transformer-based methods such as RoBERTa and GPT-3 have led to significant experimental advances in natural language processing tasks such as question answering and commonsense reasoning. The latter is typically evaluated through multiple benchmarks framed as multiple-choice instances of the former. According to influential leaderboards hosted by the Allen Institute (evaluating state-of-the-art performance on commonsense reasoning benchmarks), models based on such transformer methods are approaching human-like performance and have average accuracy well over 80% on many benchmarks. Since these are commonsense benchmarks, a model that generalizes on commonsense reasoning should not experience much performance loss across multiple commonsense benchmarks. In this paper, we study the generalization issue in detail by designing and conducting a rigorous scientific study. Using five common benchmarks, multiple controls and statistical analysis, we find clear evidence that fine-tuned commonsense language models still do not generalize well, even with moderate changes to the experimental setup, and may, in fact, be susceptible to dataset bias. We also perform selective studies, including qualitative and consistency analyses, to gain deeper insight into the problem.