在多维连续特征空间中搜索判别单词

论文标题

在多维连续特征空间中搜索判别单词

Searching for Discriminative Words in Multidimensional Continuous Feature Space

论文作者

Sajgalik, Marius, Barla, Michal, Bielikova, Maria

论文摘要

已证明Word功能向量可以改善许多NLP任务。随着对这些功能向量的无监督学习的最新进展，可以使用更多数据来训练它，这也导致了学识渊博的功能质量。由于它了解了单词潜在特征的联合概率，因此我们可以在没有任何要解决的目标任务的知识的情况下训练它。我们旨在评估特征向量的通用适用性属性，该特征向量已被证明可以针对许多标准的NLP任务（例如词性标记或句法解析）保留。在我们的情况下，我们希望了解文本文档的主题重点，并设计有效的表示，适合区分不同主题。可以在文本分类任务上充分评估歧视性。我们提出了一种新颖的方法，可以从文档中提取歧视性关键词。我们利用单词特征向量来更好地理解单词之间的关系，并理解文本中讨论的潜在主题，而不是直接提及，而是从逻辑上推断出来。我们还提供了一种简单的方法来计算文档特征向量，以提取的歧视性词来计算文档。我们在四个最受欢迎的数据集上评估了我们的方法，以进行文本分类。我们展示了不同的判别指标如何影响总体结果。我们通过仅使用少量提取的关键字来实现文本分类任务的最新结果来证明我们的方法的有效性。我们证明，单词特征向量可以基本上改善文档含义的局部推断。我们得出的结论是，在我们证明和构建文档的特征向量时，可以使用单词的分布式表示形式来构建更高级别的抽象。

Word feature vectors have been proven to improve many NLP tasks. With recent advances in unsupervised learning of these feature vectors, it became possible to train it with much more data, which also resulted in better quality of learned features. Since it learns joint probability of latent features of words, it has the advantage that we can train it without any prior knowledge about the goal task we want to solve. We aim to evaluate the universal applicability property of feature vectors, which has been already proven to hold for many standard NLP tasks like part-of-speech tagging or syntactic parsing. In our case, we want to understand the topical focus of text documents and design an efficient representation suitable for discriminating different topics. The discriminativeness can be evaluated adequately on text categorisation task. We propose a novel method to extract discriminative keywords from documents. We utilise word feature vectors to understand the relations between words better and also understand the latent topics which are discussed in the text and not mentioned directly but inferred logically. We also present a simple way to calculate document feature vectors out of extracted discriminative words. We evaluate our method on the four most popular datasets for text categorisation. We show how different discriminative metrics influence the overall results. We demonstrate the effectiveness of our approach by achieving state-of-the-art results on text categorisation task using just a small number of extracted keywords. We prove that word feature vectors can substantially improve the topical inference of documents' meaning. We conclude that distributed representation of words can be used to build higher levels of abstraction as we demonstrate and build feature vectors of documents.

下载PDF全文

下载文献需遵守相关版权规定

论文标题