论文标题
编解码器:复杂的文档和实体收集
CODEC: Complex Document and Entity Collection
论文作者
论文摘要
编解码器是专注于复杂研究主题的文档和实体排名基准。我们针对社会科学研究人员的论文式信息需求,即“英国的开放银行法规如何使挑战者银行受益?”。编解码器包括研究人员开发的42个主题和一个新的重点网络语料库,其中包含语义注释,包括实体链接。该资源包括来自不同自动和交互式手动运行的17,509个文件和实体(每个主题416.9)的专家判断。手册运行包括387个查询重新汇总,提供查询性能预测的数据和自动重写评估。 编解码器包括对最新系统的分析,包括密集的检索和神经重新排列。结果表明,主题在文档和实体排名改进方面具有挑战性。使用实体信息的查询扩展显示了文档排名的显着收益,这证明了资源评估和改善面向实体的搜索的价值。我们还表明,手动查询重新策略可显着提高文档排名和实体排名绩效。总体而言,编解码器提供了具有挑战性的研究主题,以支持以实体搜索方法的开发和评估。
CODEC is a document and entity ranking benchmark that focuses on complex research topics. We target essay-style information needs of social science researchers, i.e. "How has the UK's Open Banking Regulation benefited Challenger Banks?". CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. This resource includes expert judgments on 17,509 documents and entities (416.9 per topic) from diverse automatic and interactive manual runs. The manual runs include 387 query reformulations, providing data for query performance prediction and automatic rewriting evaluation. CODEC includes analysis of state-of-the-art systems, including dense retrieval and neural re-ranking. The results show the topics are challenging with headroom for document and entity ranking improvement. Query expansion with entity information shows significant gains in document ranking, demonstrating the resource's value for evaluating and improving entity-oriented search. We also show that the manual query reformulations significantly improve document ranking and entity ranking performance. Overall, CODEC provides challenging research topics to support the development and evaluation of entity-centric search methods.