论文标题

蒙版自动编码器作为预训练句子表示的统一学习者

Masked Autoencoders As The Unified Learners For Pre-Trained Sentence Representation

论文作者

Liu, Alexander, Yang, Samuel

论文摘要

尽管在训练有素的语言模型方面取得了进展,但缺乏用于预训练的句子表示的统一框架。因此,它要求针对特定方案采用不同的预训练方法,并且预先训练的模型可能受其普遍性和表示质量的限制。在这项工作中,我们扩展了最近提出的MAE风格的预训练策略Rewomee,以便可以有效地支持各种句子表示任务。扩展的框架由两个阶段组成,在整个过程中进行了逆转录。第一阶段对通用语料库进行了逆转,例如Wikipedia,BookCorpus等,从中学习了基本模型。第二阶段发生在特定于领域的数据上,例如Marco和NLI女士,在该数据中,基本模型是基于逆转和对比度学习的。这两个阶段的训练前输出可能会提供不同的应用,其有效性通过全面的实验验证。具体来说,基本模型被证明对零射击检索有效,并且在贝尔基准上实现了出色的性能。继续进行预训练的模型进一步使更多的下游任务受益,包括针对MARCO女士的特定领域的密集检索,自然问题以及Senteval中标准STS和转移任务的句子嵌入质量。这项工作的经验见解可能会激发预训练的句子代表的未来设计。我们的预培训模型和源代码将发布给公共社区。

Despite the progresses on pre-trained language models, there is a lack of unified frameworks for pre-trained sentence representation. As such, it calls for different pre-training methods for specific scenarios, and the pre-trained models are likely to be limited by their universality and representation quality. In this work, we extend the recently proposed MAE style pre-training strategy, RetroMAE, such that it may effectively support a wide variety of sentence representation tasks. The extended framework consists of two stages, with RetroMAE conducted throughout the process. The first stage performs RetroMAE over generic corpora, like Wikipedia, BookCorpus, etc., from which the base model is learned. The second stage takes place on domain-specific data, e.g., MS MARCO and NLI, where the base model is continuingly trained based on RetroMAE and contrastive learning. The pre-training outputs at the two stages may serve different applications, whose effectiveness are verified with comprehensive experiments. Concretely, the base model are proved to be effective for zero-shot retrieval, with remarkable performances achieved on BEIR benchmark. The continuingly pre-trained models further benefit more downstream tasks, including the domain-specific dense retrieval on MS MARCO, Natural Questions, and the sentence embeddings' quality for standard STS and transfer tasks in SentEval. The empirical insights of this work may inspire the future design of sentence representation pre-training. Our pre-trained models and source code will be released to the public communities.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源