通过词汇语义变化改善预训练的语言模型的时间概括

论文标题

通过词汇语义变化改善预训练的语言模型的时间概括

Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change

论文作者

Su, Zhaochen, Tang, Zecheng, Guan, Xinyan, Li, Juntao, Wu, Lijun, Zhang, Min

论文摘要

最近的研究表明，大规模的神经语言模型的时间概括能力较差，即，在过去几年的静态数据上预先训练的语言模型随着时间的推移在新兴数据上的表现较差。现有方法主要进行持续培训以减轻这种未对准。虽然在某种程度上有效，但与语言建模和下游任务有关。在本文中，我们从经验上观察到时间概括与词汇语义变化密切相关，这是自然语言的基本现象之一。基于此观察，我们提出了一种简单而有效的词汇级别掩盖策略，以培养融合的语言模型。对两个预训练的语言模型，两个不同的分类任务和四个基准数据集进行了实验，证明了我们提出的方法对现有的时间适应方法的有效性，即使用新数据的持续培训。我们的代码可在\ url {https://github.com/zhaochen0110/lmlm}上找到。

Recent research has revealed that neural language models at scale suffer from poor temporal generalization capability, i.e., the language model pre-trained on static data from past years performs worse over time on emerging data. Existing methods mainly perform continual training to mitigate such a misalignment. While effective to some extent but is far from being addressed on both the language modeling and downstream tasks. In this paper, we empirically observe that temporal generalization is closely affiliated with lexical semantic change, which is one of the essential phenomena of natural languages. Based on this observation, we propose a simple yet effective lexical-level masking strategy to post-train a converged language model. Experiments on two pre-trained language models, two different classification tasks, and four benchmark datasets demonstrate the effectiveness of our proposed method over existing temporal adaptation methods, i.e., continual training with new data. Our code is available at \url{https://github.com/zhaochen0110/LMLM}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题