多语言信息检索的神经方法

论文标题

多语言信息检索的神经方法

Neural Approaches to Multilingual Information Retrieval

论文作者

Lawrie, Dawn, Yang, Eugene, Oard, Douglas W., Mayfield, James

论文摘要

数十年来，跨语言提供信息的访问一直是信息检索（IR）的目标。虽然在跨语言IR（Clir）上取得了进展，其中查询以一种语言和文档表示，但多语言（MLIR）任务以创建许多语言中的单个文档列表更具挑战性。本文调查了神经文档翻译和预算的多语言神经语言模型的进步是否能够改善早期MLIR技术的状态。结果表明，尽管将神经文档翻译与神经排名相结合，但最佳平均平均精度（MAP），该地图得分的98％可以通过使用预读的XLM-R多语言语言模型来降低索引时间为84％，以其母语中的索引文档，而有效性的2％差异在统计上没有统计学上的差异。实现MLIR结果的关键是使用MS MAS MAS CASSAGES神经翻译的混合语言批次对XLM-R进行微调。

Providing access to information across languages has been a goal of Information Retrieval (IR) for decades. While progress has been made on Cross Language IR (CLIR) where queries are expressed in one language and documents in another, the multilingual (MLIR) task to create a single ranked list of documents across many languages is considerably more challenging. This paper investigates whether advances in neural document translation and pretrained multilingual neural language models enable improvements in the state of the art over earlier MLIR techniques. The results show that although combining neural document translation with neural ranking yields the best Mean Average Precision (MAP), 98% of that MAP score can be achieved with an 84% reduction in indexing time by using a pretrained XLM-R multilingual language model to index documents in their native language, and that 2% difference in effectiveness is not statistically significant. Key to achieving these results for MLIR is to fine-tune XLM-R using mixed-language batches from neural translations of MS MARCO passages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题