论文标题
合奏特征选择与阿尔茨海默氏病生物标志物发现的数据驱动阈值
Ensemble feature selection with data-driven thresholding for Alzheimer's disease biomarker discovery
论文作者
论文摘要
医疗保健数据集对机器学习和统计数据都带来了许多挑战,因为它们的数据通常是异质的,审查的,高维的,并且缺少信息。特征选择通常用于识别重要特征,但是当应用于高维数据时,可以产生不稳定的结果,从而在每次迭代中选择一组不同的功能。 通过使用特征选择集合可以提高特征选择的稳定性,该合奏汇总了多个基本特征选择器的结果。必须将阈值应用于最终的汇总功能集,以将相关功能与冗余功能分开。通常应用的固定阈值不保证最终选定的功能仅包含相关功能。这项工作开发了几个数据驱动的阈值,以自动识别集合功能选择器中的相关特征,并评估其预测精度和稳定性。 为了证明这些方法对临床数据的适用性,将它们应用于来自两个现实世界中阿尔茨海默氏病(AD)研究的数据。 AD是一种没有已知治愈方法的进行性神经退行性疾病,在明显症状出现之前至少2-3年开始,为研究人员提供了一个机会,可以鉴定出可能识别出可能患有AD风险的患者的早期生物标志物。通过将这些方法应用于两个数据集来标识的功能反映了广告文献中的当前发现。
Healthcare datasets present many challenges to both machine learning and statistics as their data are typically heterogeneous, censored, high-dimensional and have missing information. Feature selection is often used to identify the important features but can produce unstable results when applied to high-dimensional data, selecting a different set of features on each iteration. The stability of feature selection can be improved with the use of feature selection ensembles, which aggregate the results of multiple base feature selectors. A threshold must be applied to the final aggregated feature set to separate the relevant features from the redundant ones. A fixed threshold, which is typically applied, offers no guarantee that the final set of selected features contains only relevant features. This work develops several data-driven thresholds to automatically identify the relevant features in an ensemble feature selector and evaluates their predictive accuracy and stability. To demonstrate the applicability of these methods to clinical data, they are applied to data from two real-world Alzheimer's disease (AD) studies. AD is a progressive neurodegenerative disease with no known cure, that begins at least 2-3 decades before overt symptoms appear, presenting an opportunity for researchers to identify early biomarkers that might identify patients at risk of developing AD. Features identified by applying these methods to both datasets reflect current findings in the AD literature.