veromae v2：双链蒙面的自动编码器，用于预训练以检索为导向的语言模型

论文标题

veromae v2：双链蒙面的自动编码器，用于预训练以检索为导向的语言模型

RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

论文作者

Xiao, Shitao, Liu, Zheng

论文摘要

为了更好地支持检索应用程序，例如Web搜索和问题答案，越来越多地努力开发面向检索的语言模型。大多数现有作品都集中在提高[CLS]令牌上下文化嵌入的语义表示能力上。但是，最近的研究表明，除[CLS]之外的普通令牌可能会提供额外的信息，这有助于产生更好的表示效果。因此，有必要扩展当前的方法，其中所有上下文化的嵌入可以共同训练以进行检索任务。通过这种动机，我们提出了一种新的预训练方法：双链胶遮罩的自动编码器，也就是Dupmae，该方法的目标是提高[CLS]和普通令牌的上下文化嵌入语义表示能力。它介绍了两个解码任务：一个是根据[CLS]嵌入重建原始输入句子，另一个是根据整个普通令牌的嵌入方式最大程度地减少有关输入句子的单词损失（BOW）。这两个解码损耗被加起来以训练统一的编码模型。降低和聚合后，来自[Cls]和普通令牌的嵌入被串联为输入的统一语义表示。 Dupmae很简单，但在经验上具有竞争力：以较小的解码成本，它极大地有助于该模型的表示能力和可转移性，在Marco和Beir基准上取得了显着改进。

To better support retrieval applications such as web search and question answering, growing effort is made to develop retrieval-oriented language models. Most of the existing works focus on improving the semantic representation capability for the contextualized embedding of [CLS] token. However, recent study shows that the ordinary tokens besides [CLS] may provide extra information, which helps to produce a better representation effect. As such, it's necessary to extend the current methods where all contextualized embeddings can be jointly pre-trained for the retrieval tasks. With this motivation, we propose a new pre-training method: duplex masked auto-encoder, a.k.a. DupMAE, which targets on improving the semantic representation capacity for the contextualized embeddings of both [CLS] and ordinary tokens. It introduces two decoding tasks: one is to reconstruct the original input sentence based on the [CLS] embedding, the other one is to minimize the bag-of-words loss (BoW) about the input sentence based on the entire ordinary tokens' embeddings. The two decoding losses are added up to train a unified encoding model. The embeddings from [CLS] and ordinary tokens, after dimension reduction and aggregation, are concatenated as one unified semantic representation for the input. DupMAE is simple but empirically competitive: with a small decoding cost, it substantially contributes to the model's representation capability and transferability, where remarkable improvements are achieved on MS MARCO and BEIR benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题