论文标题
代表单词嵌入与主题嵌入混合物的混合物
Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings
论文作者
论文摘要
主题模型通常被称为生成模型,该模型解释了如何在给定一组主题和特定于文档的主题比例的情况下生成文档的每个单词。它的重点是捕获文档中的共发生一词,因此在分析简短文档时通常会遭受绩效差。此外,其参数估计通常依赖于近似后推断,该推断是不可扩展的,要么遭受较大的近似误差。本文介绍了一个新的主题建模框架,其中每个文档都被视为一组嵌入向量,每个主题都在同一嵌入空间中建模为嵌入向量。将单词和主题嵌入在同一向量空间中,我们定义了一种测量文档单词嵌入向量和这些主题之间的语义差异的方法,并优化了主题嵌入,以最大程度地减少所有文档的预期差异。文本分析的实验表明,所提出的方法可容纳基于微型随机梯度下降的优化,因此可以扩展到大型语料库,在发现更连贯和多样的主题并提取更好的文档表示方面提供了竞争性能。
A topic model is often formulated as a generative model that explains how each word of a document is generated given a set of topics and document-specific topic proportions. It is focused on capturing the word co-occurrences in a document and hence often suffers from poor performance in analyzing short documents. In addition, its parameter estimation often relies on approximate posterior inference that is either not scalable or suffers from large approximation error. This paper introduces a new topic-modeling framework where each document is viewed as a set of word embedding vectors and each topic is modeled as an embedding vector in the same embedding space. Embedding the words and topics in the same vector space, we define a method to measure the semantic difference between the embedding vectors of the words of a document and these of the topics, and optimize the topic embeddings to minimize the expected difference over all documents. Experiments on text analysis demonstrate that the proposed method, which is amenable to mini-batch stochastic gradient descent based optimization and hence scalable to big corpora, provides competitive performance in discovering more coherent and diverse topics and extracting better document representations.