用于信息检索的预培训：超链接是否完全探索？

论文标题

用于信息检索的预培训：超链接是否完全探索？

Pre-training for Information Retrieval: Are Hyperlinks Fully Explored?

论文作者

Wu, Jiawen, Zhang, Xinyu, Zhu, Yutao, Liu, Zheng, Guo, Zikai, Fei, Zhaoye, Lai, Ruofei, Wu, Yongkang, Cao, Zhao, Dou, Zhicheng

论文摘要

近年来，在应用预训练的语言模型（例如Bert）上，取得了巨大进展，以获取信息检索（IR）任务。在网页中通常使用的超链接已被利用用于设计预训练目标。例如，超链接的锚文本已用于模拟查询，从而构建了巨大的查询文档对以进行预训练。但是，作为跨两个网页的桥梁，尚未完全探索超链接的潜力。在这项工作中，我们专注于建模通过超链接连接的两个文档之间的关系，并为临时检索设计一个新的预训练目标。具体来说，我们将文档之间的关系分为四组：无链接，单向链接，对称链接和最相关的对称链接。通过比较来自相邻组的两个文档，该模型可以逐渐提高其捕获匹配信号的能力。我们提出了一个渐进的超链接预测（{php}）框架，以探索预训练中超链接的利用。对两个大规模临时检索数据集和六个提问数据集的实验结果证明了其优于现有的培训方法。

Recent years have witnessed great progress on applying pre-trained language models, e.g., BERT, to information retrieval (IR) tasks. Hyperlinks, which are commonly used in Web pages, have been leveraged for designing pre-training objectives. For example, anchor texts of the hyperlinks have been used for simulating queries, thus constructing tremendous query-document pairs for pre-training. However, as a bridge across two web pages, the potential of hyperlinks has not been fully explored. In this work, we focus on modeling the relationship between two documents that are connected by hyperlinks and designing a new pre-training objective for ad-hoc retrieval. Specifically, we categorize the relationships between documents into four groups: no link, unidirectional link, symmetric link, and the most relevant symmetric link. By comparing two documents sampled from adjacent groups, the model can gradually improve its capability of capturing matching signals. We propose a progressive hyperlink predication ({PHP}) framework to explore the utilization of hyperlinks in pre-training. Experimental results on two large-scale ad-hoc retrieval datasets and six question-answering datasets demonstrate its superiority over existing pre-training methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题