检索无监督关键字提取的效率折衷

论文标题

检索无监督关键字提取的效率折衷

Retrieval-efficiency trade-off of Unsupervised Keyword Extraction

论文作者

Škrlj, Blaž, Koloski, Boshko, Pollak, Senja

论文摘要

有效地识别代表给定文档的钥匙串是一项具有挑战性的任务。在过去的几年中，提出了大量关键字检测方法。这些方法可以基于例如令牌，专门神经语言模型的统计（基于频率）的属性，也可以基于从给定文档派生的基于图的结构。基于图的方法可以在最有效的方法中进行计算，同时保持检索性能。基于图的方法常见的主要属性之一是它们将令牌空间的立即转换为图形，然后进行后续处理。在本文中，我们探索了一种新颖的无监督方法，该方法在构造令牌图之前以顺序形式合并了文档的一部分。此外，通过利用个性化的Pagerank，该Pagerank在节点排名期间考虑了此类子词句的频率以及令牌长度，我们展示了最新的检索能力，同时比当前最先进的不受监督的检测器快两个数量级的频率，例如Yake和Multipartiterank。还通过计算不到一分钟内由1400万个文档组成的生物医学语料库的键形键，也证明了该方法的可伸缩性。

Efficiently identifying keyphrases that represent a given document is a challenging task. In the last years, plethora of keyword detection approaches were proposed. These approaches can be based on statistical (frequency-based) properties of e.g., tokens, specialized neural language models, or a graph-based structure derived from a given document. The graph-based methods can be computationally amongst the most efficient ones, while maintaining the retrieval performance. One of the main properties, common to graph-based methods, is their immediate conversion of token space into graphs, followed by subsequent processing. In this paper, we explore a novel unsupervised approach which merges parts of a document in sequential form, prior to construction of the token graph. Further, by leveraging personalized PageRank, which considers frequencies of such sub-phrases alongside token lengths during node ranking, we demonstrate state-of-the-art retrieval capabilities while being up to two orders of magnitude faster than current state-of-the-art unsupervised detectors such as YAKE and MultiPartiteRank. The proposed method's scalability was also demonstrated by computing keyphrases for a biomedical corpus comprised of 14 million documents in less than a minute.

下载PDF全文

下载文献需遵守相关版权规定

论文标题