论文标题
稀疏计数数据的概率规范相关分析
Probabilistic Canonical Correlation Analysis for Sparse Count Data
论文作者
论文摘要
规范相关分析(CCA)是一种经典而重要的多元技术,用于探索两组连续变量之间的关系。 CCA在许多领域都有应用,例如基因组学和神经影像学。它可以提取有意义的功能,并使用这些功能进行后续分析。尽管已经开发了一些稀疏的CCA方法来处理高维问题,但它们是专门为连续数据设计的,并且不考虑来自下一代测序平台的整数值数据,这些数据表现出非常低的计数,这些数据表现出很低的计数。我们提出了一种基于模型的概率方法,用于两个稀疏计数数据集(PSCCA)的相关性和规范相关估计。 PSCCA证明,在自然参数水平上估计的相关性和规范相关性比应用于原始数据的传统估计方法更合适。我们通过模拟研究证明,PSCCA在估计自然参数水平上的真实相关性和规范相关性方面优于其他标准相关方法和稀疏CCA方法。我们进一步应用PSCCA方法来研究MiRNA和MRNA表达数据集的关联,从鳞状细胞肺癌研究中,发现PSCCA比标准相关性和其他稀疏CCA方法可以发现大量强相关的对。
Canonical correlation analysis (CCA) is a classical and important multivariate technique for exploring the relationship between two sets of continuous variables. CCA has applications in many fields, such as genomics and neuroimaging. It can extract meaningful features as well as use these features for subsequent analysis. Although some sparse CCA methods have been developed to deal with high-dimensional problems, they are designed specifically for continuous data and do not consider the integer-valued data from next-generation sequencing platforms that exhibit very low counts for some important features. We propose a model-based probabilistic approach for correlation and canonical correlation estimation for two sparse count data sets (PSCCA). PSCCA demonstrates that correlations and canonical correlations estimated at the natural parameter level are more appropriate than traditional estimation methods applied to the raw data. We demonstrate through simulation studies that PSCCA outperforms other standard correlation approaches and sparse CCA approaches in estimating the true correlations and canonical correlations at the natural parameter level. We further apply the PSCCA method to study the association of miRNA and mRNA expression data sets from a squamous cell lung cancer study, finding that PSCCA can uncover a large number of strongly correlated pairs than standard correlation and other sparse CCA approaches.