论文标题
超链接引起的预训练用于通过开放域中的通过
Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering
论文作者
论文摘要
为了减轻培训问答系统中的数据稀缺问题,最近的工作提出了额外的中级预培训,以进行密集的通道检索(DPR)。但是,所提供的上游信号与下游问题邮票相关性之间仍然存在很大的差异,这导致了较小的改善。为了弥合这一差距,我们提出了超链接诱导的预训练(HLP),这是一种通过Web文档中基于超链接的拓扑引起的文本相关性预先培训培训的方法。我们证明,基于双链接的双链接结构和联合会可以为大规模预训练提供有效的相关性信号,从而更好地促进下游通道检索。我们研究了在零射击,很少射击,多跳和室外场景下的各种开放域QA数据集中我们方法的有效性。该实验表明,在零摄像的情况下,我们的HLP的表现优于BM25,最多高7点以及其他预训练方法在TOP-20的检索准确性方面超过10点。此外,在其他情况下,HLP明显优于其他预训练方法。
To alleviate the data scarcity problem in training question answering systems, recent works propose additional intermediate pre-training for dense passage retrieval (DPR). However, there still remains a large discrepancy between the provided upstream signals and the downstream question-passage relevance, which leads to less improvement. To bridge this gap, we propose the HyperLink-induced Pre-training (HLP), a method to pre-train the dense retriever with the text relevance induced by hyperlink-based topology within Web documents. We demonstrate that the hyperlink-based structures of dual-link and co-mention can provide effective relevance signals for large-scale pre-training that better facilitate downstream passage retrieval. We investigate the effectiveness of our approach across a wide range of open-domain QA datasets under zero-shot, few-shot, multi-hop, and out-of-domain scenarios. The experiments show our HLP outperforms the BM25 by up to 7 points as well as other pre-training methods by more than 10 points in terms of top-20 retrieval accuracy under the zero-shot scenario. Furthermore, HLP significantly outperforms other pre-training methods under the other scenarios.