厌倦了主题模型？一组预处理的单词嵌入也可以快速，好主题！

论文标题

厌倦了主题模型？一组预处理的单词嵌入也可以快速，好主题！

Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too!

论文作者

Sia, Suzanna, Dalmia, Ayush, Mielke, Sabrina J.

论文摘要

主题模型是一个有用的分析工具，可在文档收集中发现基本主题。主要的方法是使用具有生成故事的概率主题模型，但是在本文中，我们提出了一种获取主题的替代方法：聚类预训练的单词嵌入，同时将文档信息合并到加权群集和重新加工的顶级单词。我们为不同单词嵌入和聚类算法的组合提供了基准，并通过PCA分析了它们的性能。我们方法的最佳性能组合以及经典主题模型的性能，但运行时和计算复杂性较低。

Topic models are a useful analysis tool to uncover the underlying themes within document collections. The dominant approach is to use probabilistic topic models that posit a generative story, but in this paper we propose an alternative way to obtain topics: clustering pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words. We provide benchmarks for the combination of different word embeddings and clustering algorithms, and analyse their performance under dimensionality reduction with PCA. The best performing combination for our approach performs as well as classical topic models, but with lower runtime and computational complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题