论文标题

法律和专利领域的跨域检索:可重复性研究

Cross-domain Retrieval in the Legal and Patent Domains: a Reproducibility Study

论文作者

Althammer, Sophia, Hofstätter, Sebastian, Hanbury, Allan

论文摘要

由于诸如域特定的语言,独特的任务设置以及缺乏可访问的查询和相应的相关性判断之类的几个挑战,因此特定于域特定的搜索一直是一项具有挑战性的信息检索任务。在过去的几年中,伯特(Bert)等审计的语言模型彻底改变了网络和新闻搜索。自然,社区旨在使这些进步适应用于域特定搜索的检索模型的跨域转移。在法律文件检索的背景下,Shao等。通过与语言模型Bert建模段落级别的交互来提出BERT-PLI框架。在本文中,我们重现了原始实验,我们澄清了预处理步骤,为框架步骤添加缺少的脚本并研究不同的评估方法,但是我们无法重现评估结果。与原始论文相反,我们证明了特定段落级的建模似乎并不能帮助与原始BERT相比,与段落级建模相比,BERT-PLI模型的性能。除了我们的合法搜索可重复性研究外,我们还研究了专利领域中文档检索的Bert-Pli。我们发现,与BM25基线相比,BERT-PLI模型尚未实现专利文档检索的性能。此外,我们在段落和文档级别上评估了单个组件上法律和专利域之间的跨域检索的BERT-PLI模型。我们发现,BERT-PLI模型在段落级别上的转移导致两个域之间的可比结果以及文档级别跨域传输的首次有希望的结果。为了使可重复性和透明度以及使社区受益,我们制定了我们的源代码和受过训练的模型。

Domain specific search has always been a challenging information retrieval task due to several challenges such as the domain specific language, the unique task setting, as well as the lack of accessible queries and corresponding relevance judgements. In the last years, pretrained language models, such as BERT, revolutionized web and news search. Naturally, the community aims to adapt these advancements to cross-domain transfer of retrieval models for domain specific search. In the context of legal document retrieval, Shao et al. propose the BERT-PLI framework by modeling the Paragraph Level Interactions with the language model BERT. In this paper we reproduce the original experiments, we clarify pre-processing steps, add missing scripts for framework steps and investigate different evaluation approaches, however we are not able to reproduce the evaluation results. Contrary to the original paper, we demonstrate that the domain specific paragraph-level modelling does not appear to help the performance of the BERT-PLI model compared to paragraph-level modelling with the original BERT. In addition to our legal search reproducibility study, we investigate BERT-PLI for document retrieval in the patent domain. We find that the BERT-PLI model does not yet achieve performance improvements for patent document retrieval compared to the BM25 baseline. Furthermore, we evaluate the BERT-PLI model for cross-domain retrieval between the legal and patent domain on individual components, both on a paragraph and document-level. We find that the transfer of the BERT-PLI model on the paragraph-level leads to comparable results between both domains as well as first promising results for the cross-domain transfer on the document-level. For reproducibility and transparency as well as to benefit the community we make our source code and the trained models publicly available.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源