论文标题
基于弹性网的功能排名和选择
Elastic Net based Feature Ranking and Selection
论文作者
论文摘要
特征选择在数据表示和智能诊断中很重要。弹性网是使用最广泛的功能选择器之一。但是,所选的功能取决于训练数据,其专用于正规回归的权重与它们用于功能排名的重要性无关,这会降低模型的可解释性和扩展。在这项研究中,一个直观的想法是在多次数据拆分和基于弹性网的特征选择的结束时提出的。它涉及所选功能的频率,并使用频率作为特征重要性的指标。根据其频率对特征进行排序后,线性支持向量机以增量方式执行分类。最后,通过比较预测性能选择了一个紧凑的判别特征子集。乳腺癌数据集(BCDR-F03,WDBC,GSE 10810和GSE 15852)的实验结果表明,所提出的框架可实现竞争性或优越的弹性网络性能,并且具有较少功能的一致选择。如何在我们将来的工作中更加关注如何进一步提高其在高维度小样本大小数据集上的一致性。提出的框架可在线访问(https://github.com/nicoyucn/elasticnetfr)。
Feature selection is important in data representation and intelligent diagnosis. Elastic net is one of the most widely used feature selectors. However, the features selected are dependant on the training data, and their weights dedicated for regularized regression are irrelevant to their importance if used for feature ranking, that degrades the model interpretability and extension. In this study, an intuitive idea is put at the end of multiple times of data splitting and elastic net based feature selection. It concerns the frequency of selected features and uses the frequency as an indicator of feature importance. After features are sorted according to their frequency, linear support vector machine performs the classification in an incremental manner. At last, a compact subset of discriminative features is selected by comparing the prediction performance. Experimental results on breast cancer data sets (BCDR-F03, WDBC, GSE 10810, and GSE 15852) suggest that the proposed framework achieves competitive or superior performance to elastic net and with consistent selection of fewer features. How to further enhance its consistency on high-dimension small-sample-size data sets should be paid more attention in our future work. The proposed framework is accessible online (https://github.com/NicoYuCN/elasticnetFR).