论文标题
应变表的稀疏对应分析
Sparse Correspondence Analysis for Contingency Tables
论文作者
论文摘要
自从回归中引入套索以来,已经在无监督的背景下开发了各种稀疏方法,例如稀疏主成分分析(S-PCA),稀疏的规范相关性分析(S-CCA)和稀疏的奇异价值分解(S-S-SVD)。这些稀疏的方法结合了特征选择和降低尺寸。 S-PCA的一个优点是简化(伪)主组件的解释,因为每个组件都表示为少数变量的线性组合。一方面,缺点在于在没有良好确定的标准的情况下,很难选择非零系数的数量,而另一方面,由于组件和/或负载的正交性丧失。在本文中,我们提出了对应分析(CA)的稀疏变体(CA),例如文本挖掘中使用的文档-TERMS矩阵,以及PPMD,这是一种来自S-PCA的投影缩放的介绍技术。我们使用这样一个事实,即CA是双重加权PCA(用于行和列)或加权SVD,以及指示变量的规范相关分析。应用S-CCA或S-SVD允许稀疏行和列的权重。用户可以调整行的稀疏度,并根据某些标准对其进行优化,甚至决定通过放松一个稀疏性约束来行(或列)不需要稀疏性。后者等效于将S-PCA应用于行(或列)配置文件的矩阵。
Since the introduction of the lasso in regression, various sparse methods have been developed in an unsupervised context like sparse principal component analysis (s-PCA), sparse canonical correlation analysis (s-CCA) and sparse singular value decomposition (s-SVD). These sparse methods combine feature selection and dimension reduction. One advantage of s-PCA is to simplify the interpretation of the (pseudo) principal components since each one is expressed as a linear combination of a small number of variables. The disadvantages lie on the one hand in the difficulty of choosing the number of non-zero coefficients in the absence of a well established criterion and on the other hand in the loss of orthogonality for the components and/or the loadings. In this paper we propose sparse variants of correspondence analysis (CA)for large contingency tables like documents-terms matrices used in text mining, together with pPMD, a deation technique derived from projected deflation in s-PCA. We use the fact that CA is a double weighted PCA (for rows and columns) or a weighted SVD, as well as a canonical correlation analysis of indicator variables. Applying s-CCA or s-SVD allows to sparsify both rows and columns weights. The user may tune the level of sparsity of rows and columns and optimize it according to some criterium, and even decide that no sparsity is needed for rows (or columns) by relaxing one sparsity constraint. The latter is equivalent to apply s-PCA to matrices of row (or column) profiles.