解决新兴领域的开放式回答问题回答系统中的跨语言问题

论文标题

解决新兴领域的开放式回答问题回答系统中的跨语言问题

Addressing Issues of Cross-Linguality in Open-Retrieval Question Answering Systems For Emergent Domains

论文作者

Albalak, Alon, Levy, Sharon, Wang, William Yang

论文摘要

通常，在建立良好的域中的大型数据集上对开放式回答答案系统进行培训和测试。但是，诸如新的和新兴域之类的低资源设置将尤其受益于可靠的问答系统。此外，新兴域中的多语言和跨语性资源很少，因此很少或没有此类系统。在本文中，我们为Covid-19的新兴领域展示了一个跨语性的开放式回答问答系统。我们的系统采用科学文章的语料库，以确保检索到的文件可靠。为了解决新兴域中跨语言培训数据的稀缺性，我们提出了一种利用自动翻译，对齐方式和过滤以生成英语到所有数据集的方法。我们表明，深层的语义猎犬可以通过对我们的英语数据到全部数据进行培训，并在跨语化环境中胜过BM25基线。我们用示例说明了系统的功能，并发布了训练和部署此类系统所需的所有代码。

Open-retrieval question answering systems are generally trained and tested on large datasets in well-established domains. However, low-resource settings such as new and emerging domains would especially benefit from reliable question answering systems. Furthermore, multilingual and cross-lingual resources in emergent domains are scarce, leading to few or no such systems. In this paper, we demonstrate a cross-lingual open-retrieval question answering system for the emergent domain of COVID-19. Our system adopts a corpus of scientific articles to ensure that retrieved documents are reliable. To address the scarcity of cross-lingual training data in emergent domains, we present a method utilizing automatic translation, alignment, and filtering to produce English-to-all datasets. We show that a deep semantic retriever greatly benefits from training on our English-to-all data and significantly outperforms a BM25 baseline in the cross-lingual setting. We illustrate the capabilities of our system with examples and release all code necessary to train and deploy such a system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题