论文标题
一种紧凑的神经语言模型的预处理方法
A Compact Pretraining Approach for Neural Language Models
论文作者
论文摘要
大型神经语言模型(NLMS)的域适应性在训练阶段与大量的非结构化数据结合在一起。但是,在这项研究中,我们表明,经过预计的NLMS从紧凑的数据子集中更有效,更快地学习内域信息,该数据集中在域中的关键信息上。我们使用抽象性摘要和提取性关键字的组合从非结构化数据中构建这些紧凑的子集。特别是,我们依靠Bart生成抽象性摘要,而Keybert从这些摘要中提取关键字(或直接的原始非结构化文本)。我们使用六个不同的设置评估我们的方法:三个数据集与两个不同的NLMS结合使用。我们的结果表明,使用我们的方法在NLM上训练的特定任务分类器,使用我们的方法概述了基于传统预处理的方法,即在整个数据上随机掩盖,以及无需训练的方法。此外,我们表明我们的策略将预处理的时间降低了五倍,而这是香草预处理的五倍。我们所有实验的代码均在https://github.com/shahriargolchin/compact-pretraining上公开获得。
Domain adaptation for large neural language models (NLMs) is coupled with massive amounts of unstructured data in the pretraining phase. In this study, however, we show that pretrained NLMs learn in-domain information more effectively and faster from a compact subset of the data that focuses on the key information in the domain. We construct these compact subsets from the unstructured data using a combination of abstractive summaries and extractive keywords. In particular, we rely on BART to generate abstractive summaries, and KeyBERT to extract keywords from these summaries (or the original unstructured text directly). We evaluate our approach using six different settings: three datasets combined with two distinct NLMs. Our results reveal that the task-specific classifiers trained on top of NLMs pretrained using our method outperform methods based on traditional pretraining, i.e., random masking on the entire data, as well as methods without pretraining. Further, we show that our strategy reduces pretraining time by up to five times compared to vanilla pretraining. The code for all of our experiments is publicly available at https://github.com/shahriargolchin/compact-pretraining.