论文标题
评估神经检索模型的插值和外推性能
Evaluating Interpolation and Extrapolation Performance of Neural Retrieval Models
论文作者
论文摘要
检索模型不仅应插入训练数据,还应很好地推断出与训练数据不同的查询。尽管神经检索模型在临时搜索基准上表现出了令人印象深刻的性能,但我们仍然对它们在插值和推断方面的表现知之甚少。在本文中,我们证明了分别评估神经检索模型的两个功能的重要性。首先,我们从两个角度研究了现有的临时搜索基准。我们研究培训和测试数据的分布,并在查询实体,查询意图和相关性标签中找到相当大的重叠。这一发现意味着对这些测试集的评估偏向插值,无法准确反映推外能力。其次,我们提出了一种新颖的评估协议,以分别评估现有基准数据集上的插值和外推性能。它根据查询相似性重新示例培训和测试数据,并利用重采样数据集进行培训和评估。最后,我们利用拟议的评估协议来全面重新审视许多广泛的神经检索模型。结果表明,从插值到外推时,模型的性能有所不同。例如,基于表示的检索模型几乎与基于相互作用的检索模型在插值方面的表现相同,但不能推断。因此,有必要单独评估插值和外推性能,而拟议的重新采样方法是一种简单而有效的IR研究工具。
A retrieval model should not only interpolate the training data but also extrapolate well to the queries that are different from the training data. While neural retrieval models have demonstrated impressive performance on ad-hoc search benchmarks, we still know little about how they perform in terms of interpolation and extrapolation. In this paper, we demonstrate the importance of separately evaluating the two capabilities of neural retrieval models. Firstly, we examine existing ad-hoc search benchmarks from the two perspectives. We investigate the distribution of training and test data and find a considerable overlap in query entities, query intent, and relevance labels. This finding implies that the evaluation on these test sets is biased toward interpolation and cannot accurately reflect the extrapolation capacity. Secondly, we propose a novel evaluation protocol to separately evaluate the interpolation and extrapolation performance on existing benchmark datasets. It resamples the training and test data based on query similarity and utilizes the resampled dataset for training and evaluation. Finally, we leverage the proposed evaluation protocol to comprehensively revisit a number of widely-adopted neural retrieval models. Results show models perform differently when moving from interpolation to extrapolation. For example, representation-based retrieval models perform almost as well as interaction-based retrieval models in terms of interpolation but not extrapolation. Therefore, it is necessary to separately evaluate both interpolation and extrapolation performance and the proposed resampling method serves as a simple yet effective evaluation tool for future IR studies.