论文标题
高维基因组数据中高度相关预测指标的可变选择方法
A variable selection approach for highly correlated predictors in high-dimensional genomic data
论文作者
论文摘要
在基因组研究中,识别与感兴趣变量相关的生物标志物是生物医学研究的主要问题。正规化方法通常用于在高维线性模型中执行可变选择。但是,这些方法可能会在高度相关的设置中失败。我们提出了一种称为Wlasso的新型变量选择方法,考虑了这些相关性。它包括重写初始的高维线性模型,以消除生物标志物(预测变量)和应用广义套索标准之间的相关性。在几种情况下,使用合成数据评估了Wlasso的性能,并与最近的替代方法进行了比较。结果表明,当生物标志物高度相关时,Wlasso在稀疏的高维框架中的其他方法优于其他方法。该方法还成功地说明了乳腺癌中公开可用的基因表达数据。我们的方法是在Wlasso r软件包中实现的,该软件包可从综合R档案网络获得。
In genomic studies, identifying biomarkers associated with a variable of interest is a major concern in biomedical research. Regularized approaches are classically used to perform variable selection in high-dimensional linear models. However, these methods can fail in highly correlated settings. We propose a novel variable selection approach called WLasso, taking these correlations into account. It consists in rewriting the initial high-dimensional linear model to remove the correlation between the biomarkers (predictors) and in applying the generalized Lasso criterion. The performance of WLasso is assessed using synthetic data in several scenarios and compared with recent alternative approaches. The results show that when the biomarkers are highly correlated, WLasso outperforms the other approaches in sparse high-dimensional frameworks. The method is also successfully illustrated on publicly available gene expression data in breast cancer. Our method is implemented in the WLasso R package which is available from the Comprehensive R Archive Network.