论文标题

稀疏主题模型的最佳估计

Optimal estimation of sparse topic models

论文作者

Bing, Xin, Bunea, Florentina, Wegkamp, Marten

论文摘要

主题模型已成为缩小维度和探索性文本数据分析的流行工具,该工具由观察到的$ n $文档中$ p $单词的词汇频率组成,并存储在$ p \ times n $矩阵中。主要的前提是,该数据矩阵的平均值可以分解为两个非负矩阵的产物:a $ p \ times k $ word-tord-topic矩阵$ a $和a $ k \ times n $ topim-topiment opocument矩阵矩阵$ w $。本文研究了$ a $的估计,这可能是元素稀疏的,而$ k $的主题数量尚不清楚。在这种探索不足的上下文中,我们为估计此类$ a $的估计而得出了一个新的minimax下限,并为其恢复提出了一种新的计算有效算法。我们为我们的估计器提供了有限的样品上限,并表明它与许多情况下的最小下部界限匹配。我们的估计适应了$ a $的未知稀疏性,我们的分析对于任何有限的$ n $,$ p $,$ k $和文档长度都是有效的。合成数据和半合成数据的经验结果表明,我们提出的估计器是非sparse $ a $ a $ and稀疏$ a $的现有最新算法的有力竞争者,并且具有较高的性能是许多兴趣的情况。

Topic models have become popular tools for dimension reduction and exploratory analysis of text data which consists in observed frequencies of a vocabulary of $p$ words in $n$ documents, stored in a $p\times n$ matrix. The main premise is that the mean of this data matrix can be factorized into a product of two non-negative matrices: a $p\times K$ word-topic matrix $A$ and a $K\times n$ topic-document matrix $W$. This paper studies the estimation of $A$ that is possibly element-wise sparse, and the number of topics $K$ is unknown. In this under-explored context, we derive a new minimax lower bound for the estimation of such $A$ and propose a new computationally efficient algorithm for its recovery. We derive a finite sample upper bound for our estimator, and show that it matches the minimax lower bound in many scenarios. Our estimate adapts to the unknown sparsity of $A$ and our analysis is valid for any finite $n$, $p$, $K$ and document lengths. Empirical results on both synthetic data and semi-synthetic data show that our proposed estimator is a strong competitor of the existing state-of-the-art algorithms for both non-sparse $A$ and sparse $A$, and has superior performance is many scenarios of interest.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源