稀疏主题模型的最佳估计

论文标题

稀疏主题模型的最佳估计

Optimal estimation of sparse topic models

论文作者

Bing, Xin, Bunea, Florentina, Wegkamp, Marten

论文摘要

主题模型已成为缩小维度和探索性文本数据分析的流行工具，该工具由观察到的$ n $文档中$ p $单词的词汇频率组成，并存储在$ p \ times n $矩阵中。主要的前提是，该数据矩阵的平均值可以分解为两个非负矩阵的产物：a $ p \ times k $ word-tord-topic矩阵$ a $和a $ k \ times n $ topim-topiment opocument矩阵矩阵$ w $。本文研究了$ a $的估计，这可能是元素稀疏的，而$ k $的主题数量尚不清楚。在这种探索不足的上下文中，我们为估计此类$ a $的估计而得出了一个新的minimax下限，并为其恢复提出了一种新的计算有效算法。我们为我们的估计器提供了有限的样品上限，并表明它与许多情况下的最小下部界限匹配。我们的估计适应了$ a $的未知稀疏性，我们的分析对于任何有限的$ n $，$ p $，$ k $和文档长度都是有效的。合成数据和半合成数据的经验结果表明，我们提出的估计器是非sparse $ a $ a $ and稀疏$ a $的现有最新算法的有力竞争者，并且具有较高的性能是许多兴趣的情况。

Topic models have become popular tools for dimension reduction and exploratory analysis of text data which consists in observed frequencies of a vocabulary of $p$ words in $n$ documents, stored in a $p\times n$ matrix. The main premise is that the mean of this data matrix can be factorized into a product of two non-negative matrices: a $p\times K$ word-topic matrix $A$ and a $K\times n$ topic-document matrix $W$. This paper studies the estimation of $A$ that is possibly element-wise sparse, and the number of topics $K$ is unknown. In this under-explored context, we derive a new minimax lower bound for the estimation of such $A$ and propose a new computationally efficient algorithm for its recovery. We derive a finite sample upper bound for our estimator, and show that it matches the minimax lower bound in many scenarios. Our estimate adapts to the unknown sparsity of $A$ and our analysis is valid for any finite $n$, $p$, $K$ and document lengths. Empirical results on both synthetic data and semi-synthetic data show that our proposed estimator is a strong competitor of the existing state-of-the-art algorithms for both non-sparse $A$ and sparse $A$, and has superior performance is many scenarios of interest.

下载PDF全文

下载文献需遵守相关版权规定

论文标题