SIMLM：用代表瓶颈进行培训，以进行密集通道检索

论文标题

SIMLM：用代表瓶颈进行培训，以进行密集通道检索

SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

论文作者

Wang, Liang, Yang, Nan, Huang, Xiaolong, Jiao, Binxing, Yang, Linjun, Jiang, Daxin, Majumder, Rangan, Wei, Furu

论文摘要

在本文中，我们提出了SIMLM（与语言模型预训练相似性匹配），这是一种简单而有效的预训练方法，可用于繁殖。它采用了一种简单的瓶颈体系结构，该体系结构学会通过自我监管的预训练将通道信息压缩到密集的矢量中。我们使用替代的语言建模目标，该目标是受伊莱克特拉的启发的，以提高样本效率并减少预训练和微调之间的输入分布的不匹配。 SIMLM仅需要访问未标记的语料库，并且在没有标记的数据或查询时更广泛地适用。我们对几个大规模通过检索数据集进行了实验，并在各种设置下对强基础显示了实质性改进。值得注意的是，SIMLM甚至超过了多矢量方法，例如ColbertV2，它会产生更多的存储成本。我们的代码和模型检查点可在https://github.com/microsoft/unilm/tree/master/simlm上找到。

In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to unlabeled corpus, and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets, and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 which incurs significantly more storage cost. Our code and model check points are available at https://github.com/microsoft/unilm/tree/master/simlm .

下载PDF全文

下载文献需遵守相关版权规定

论文标题