论文标题

多摩变数据的两步惩罚逻辑回归,并应用于心脏代谢综合征

Two-step penalised logistic regression for multi-omic data with an application to cardiometabolic syndrome

论文作者

Cabassi, Alessandra, Seyres, Denis, Frontini, Mattia, Kirk, Paul D. W.

论文摘要

基于高维多矩数据集预测二进制类标签的构建分类模型构成了几个挑战,这是由于数据层的特征在预测指标的数量,数据类型和噪声水平方面通常存在很大差异。先前的研究表明,将经典的逻辑回归用弹性网络惩罚应用于这些数据集可能会导致结果不佳(Liu等,2018)。我们对多摩尼克逻辑回归实施了两步方法,其中分别在每个层上执行变量选择,然后使用第一步中选择的变量构建预测模型。在这里,我们的方法与出于相同目的开发的其他方法进行了比较,我们将现有软件适应了多摩变线性回归(Zhao and Zucknick,2020)的逻辑回归设置。广泛的仿真研究表明,如果目标是选择尽可能多的相关预测因子,并实现与最佳竞争对手相当的预测性能,则应优选我们的方法。我们的激励例子是一种心脏代谢综合征数据集,其中包括2种极端表型组(10个肥胖和10个脂肪营养不良个体)和185个献血者的八种“ OMIC数据类型”。我们提出的方法使我们能够识别出表征分子水平心脏代谢综合征的特征。 R代码可在https://github.com/acabassi/logistic-regression-for-multi-omic-data上找到。

Building classification models that predict a binary class label on the basis of high dimensional multi-omics datasets poses several challenges, due to the typically widely differing characteristics of the data layers in terms of number of predictors, type of data, and levels of noise. Previous research has shown that applying classical logistic regression with elastic-net penalty to these datasets can lead to poor results (Liu et al., 2018). We implement a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately and a predictive model is then built using the variables selected in the first step. Here, our approach is compared to other methods that have been developed for the same purpose, and we adapt existing software for multi-omic linear regression (Zhao and Zucknick, 2020) to the logistic regression setting. Extensive simulation studies show that our approach should be preferred if the goal is to select as many relevant predictors as possible, as well as achieving prediction performances comparable to those of the best competitors. Our motivating example is a cardiometabolic syndrome dataset comprising eight 'omic data types for 2 extreme phenotype groups (10 obese and 10 lipodystrophy individuals) and 185 blood donors. Our proposed approach allows us to identify features that characterise cardiometabolic syndrome at the molecular level. R code is available at https://github.com/acabassi/logistic-regression-for-multi-omic-data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源