致密通道检索的面向检索掩蔽训练语言模型

论文标题

致密通道检索的面向检索掩蔽训练语言模型

Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval

论文作者

Long, Dingkun, Zhang, Yanzhao, Xu, Guangwei, Xie, Pengjun

论文摘要

预先训练的语言模型（PTM）已显示可为密集的通过检索任务产生强大的文本表示。蒙版语言建模（MLM）是预训练过程的主要子任务。但是，我们发现，常规的随机掩蔽策略倾向于选择大量对通道检索任务影响有限的令牌（e，g。停止词和标点符号）。通过注意“重要性重量”一词可以为通过检索提供有价值的信息，我们在此提出了替代性检索导向的掩蔽（称为ROM）策略，在这些策略中，更重要的代币将具有更高的可能性，可以捕获这种简单而基本的信息，以促进语言模型预训练过程。值得注意的是，拟议的新令牌掩蔽方法不会改变原始PTM的架构和学习目标。我们的实验验证了所提出的ROM可以实现术语重要性信息，以帮助语言模型预训练，从而在多个段落检索基准中实现更好的性能。

Pre-trained language model (PTM) has been shown to yield powerful text representations for dense passage retrieval task. The Masked Language Modeling (MLM) is a major sub-task of the pre-training process. However, we found that the conventional random masking strategy tend to select a large number of tokens that have limited effect on the passage retrieval task (e,g. stop-words and punctuation). By noticing the term importance weight can provide valuable information for passage retrieval, we hereby propose alternative retrieval oriented masking (dubbed as ROM) strategy where more important tokens will have a higher probability of being masked out, to capture this straightforward yet essential information to facilitate the language model pre-training process. Notably, the proposed new token masking method will not change the architecture and learning objective of original PTM. Our experiments verify that the proposed ROM enables term importance information to help language model pre-training thus achieving better performance on multiple passage retrieval benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题