论文标题
Gleake:全球和本地嵌入自动键形萃取
GLEAKE: Global and Local Embedding Automatic Keyphrase Extraction
论文作者
论文摘要
大量文本文档的颗粒状分类的自动化方法随着过去几年的科学,新闻,医疗和网络文档的增长而变得越来越重要。自动键形式提取(AKE)旨在从单个文本文档中自动检测一组单个或多字,该文本文档捕获文档的主要主题。 AKE在各种NLP和信息检索任务中起重要作用,例如文档摘要和分类,全文索引和文章建议。由于缺乏不同的文本内容中的人体标记的数据,监督的学习方法并不是自动从文本机构含量中自动检测键形的理想选择。随着文本嵌入技术的最新进展,NLP研究人员致力于开发无监督的方法,以从RAW数据集中获得有意义的见解。在这项工作中,我们介绍了AKE任务的全球和本地嵌入式键形提取器(GLEAKE)。格里克(Gleake)利用单个和多词的嵌入技术来探索候选短语的句法和语义方面,然后将它们结合成一系列基于嵌入的图形。此外,Gleake在每个基于嵌入的图表上都应用网络分析技术,以完善最重要的短语作为最终的键形声。我们通过评估来自不同领域和编写样式的五个标准AKE数据集的结果,并通过对其他最新方法展示其优越性来证明Gleake的高性能。
Automated methods for granular categorization of large corpora of text documents have become increasingly more important with the rate scientific, news, medical, and web documents are growing in the last few years. Automatic keyphrase extraction (AKE) aims to automatically detect a small set of single or multi-words from within a single textual document that captures the main topics of the document. AKE plays an important role in various NLP and information retrieval tasks such as document summarization and categorization, full-text indexing, and article recommendation. Due to the lack of sufficient human-labeled data in different textual contents, supervised learning approaches are not ideal for automatic detection of keyphrases from the content of textual bodies. With the state-of-the-art advances in text embedding techniques, NLP researchers have focused on developing unsupervised methods to obtain meaningful insights from raw datasets. In this work, we introduce Global and Local Embedding Automatic Keyphrase Extractor (GLEAKE) for the task of AKE. GLEAKE utilizes single and multi-word embedding techniques to explore the syntactic and semantic aspects of the candidate phrases and then combines them into a series of embedding-based graphs. Moreover, GLEAKE applies network analysis techniques on each embedding-based graph to refine the most significant phrases as a final set of keyphrases. We demonstrate the high performance of GLEAKE by evaluating its results on five standard AKE datasets from different domains and writing styles and by showing its superiority with regards to other state-of-the-art methods.