论文标题
分开和重新融合潜在变量以改善基因组数据的分类
Separating and reintegrating latent variables to improve classification of genomic data
论文作者
论文摘要
基因组数据集除了主要关注的变量外,还包含各种未观察到的生物变量的影响。这些潜在变量通常会影响大量特征(例如基因),从而引起密集的潜在变化,这既带来了分类的挑战和机会。这些潜在变量中的一些可能与感兴趣的表型部分相关,因此有用,而其他变量可能是不相关的,因此仅产生额外的噪声。而且,无论是否有可能有所帮助,这些潜在变量可能会掩盖仅影响少数特征但更直接地捕获主要兴趣的信号的弱效应。我们提出交叉分析分类器,以更好地说明基因组数据中的潜在变量。通过调整和集成程序,交叉弥补的分类器基本上估算了潜在变量并将其效果剩余,在残留物上训练分类器,然后在最终的集合分类器中重新融合潜在变量。因此,潜在变量被解释,而无需丢弃它们可能贡献的任何潜在预测信息。我们将方法应用于模拟数据以及来自多个平台的各种基因组数据集。通常,我们发现相对于现有分类器而言,交叉救助分类器的性能很好,有时会带来可观的收益。
Genomic datasets contain the effects of various unobserved biological variables in addition to the variable of primary interest. These latent variables often affect a large number of features (e.g., genes) and thus give rise to dense latent variation, which presents both challenges and opportunities for classification. Some of these latent variables may be partially correlated with the phenotype of interest and therefore helpful, while others may be uncorrelated and thus merely contribute additional noise. Moreover, whether potentially helpful or not, these latent variables may obscure weaker effects that impact only a small number of features but more directly capture the signal of primary interest. We propose the cross-residualization classifier to better account for the latent variables in genomic data. Through an adjustment and ensemble procedure, the cross-residualization classifier essentially estimates the latent variables and residualizes out their effects, trains a classifier on the residuals, and then re-integrates the the latent variables in a final ensemble classifier. Thus, the latent variables are accounted for without discarding any potentially predictive information that they may contribute. We apply the method to simulated data as well as a variety of genomic datasets from multiple platforms. In general, we find that the cross-residualization classifier performs well relative to existing classifiers and sometimes offers substantial gains.