相关高维RNA-Seq癌数据的合奏特征选择分析

论文标题

Analysis of ensemble feature selection for correlated high-dimensional RNA-Seq cancer data

论文作者

Polewko-Klim, Aneta, Rudnicki, Witold R.

论文摘要

诊断和预后分子标记物的发现很重要，并积极追求癌症研究的研究领域。对于复杂的疾病，通常使用机器学习进行此过程。当前的研究比较了发现相关变量的两种方法：通过应用单个特征选择算法，而不是通过各种算法的集合。这些方法用于鉴定使用癌症基因组图集的RNA-SEQ谱的四种癌症类型的相关辨别的变量。比较以两个方向进行：评估模型的预测性能并监视选定变量的稳定性。最有用的功能是使用四种特征选择算法（即U-Test，Relieff和MDFS算法的两个变体）识别的。使用随机森林算法进行辨别正常组织和肿瘤组织。使用U检验时，获得了功能集的最高稳定性。不幸的是，从特征选择算法获得的特征集上构建的模型并不比从单个算法获得的功能集开发的模型中更好。另一方面，数据集之间导致最佳分类结果的功能选择器会有所不同。

Discovery of diagnostic and prognostic molecular markers is important and actively pursued the research field in cancer research. For complex diseases, this process is often performed using Machine Learning. The current study compares two approaches for the discovery of relevant variables: by application of a single feature selection algorithm, versus by an ensemble of diverse algorithms. These approaches are used to identify variables that are relevant discerning of four cancer types using RNA-seq profiles from the Cancer Genome Atlas. The comparison is carried out in two directions: evaluating the predictive performance of models and monitoring the stability of selected variables. The most informative features are identified using a four feature selection algorithms, namely U-test, ReliefF, and two variants of the MDFS algorithm. Discerning normal and tumor tissues is performed using the Random Forest algorithm. The highest stability of the feature set was obtained when U-test was used. Unfortunately, models built on feature sets obtained from the ensemble of feature selection algorithms were no better than for models developed on feature sets obtained from individual algorithms. On the other hand, the feature selectors leading to the best classification results varied between data sets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题