论文标题

随机森林中成本约束分类的特征选择方法

Feature Selection Methods for Cost-Constrained Classification in Random Forests

论文作者

Jagdhuber, Rudolf, Lang, Michel, Rahnenführer, Jörg

论文摘要

成本敏感的功能选择描述了特征选择问题,其中功能提高了个人成本以将其纳入模型。这些成本可以纳入功能的不利方面,例如在模型选择过程中,AS测量设备或患者伤害的故障率。随机森林定义了特征选择特别具有挑战性的问题,因为特征通常纠缠在多棵树的合奏中,这使事后去除特征是不可行的。因此,特征选择方法通常集中于简单的预滤波方法,或者需要沿其优化路径进行许多随机森林评估,从而大大提高了计算复杂性。为了解决这两个问题,我们提出了浅树选择,这是一种新颖的快速和多元特征选择方法,可从小树结构中选择特征。此外,我们还通过为每种方法引入高参数控制的益处比率标准(BCR)来调整三种标准特征选择算法,以实现成本敏感的学习。在一项广泛的仿真研究中,我们评估了此标准,并将提出的方法与四个人工数据设置和七个现实世界数据设置的多个基于绩效的基线替代方案进行比较。我们表明,使用高参数BCR标准的所有方法都优于基​​线替代方法。在所提出的方法之间的直接比较中,每种方法都表示某些设置中的优势,但没有一个适合所有解决方案。总体上,我们可以在基于BCR的方法中确定可取的选择。然而,我们得出的结论是,实践分析绝不应该仅依靠一种方法,而是始终比较不同的方法以获得最佳结果。

Cost-sensitive feature selection describes a feature selection problem, where features raise individual costs for inclusion in a model. These costs allow to incorporate disfavored aspects of features, e.g. failure rates of as measuring device, or patient harm, in the model selection process. Random Forests define a particularly challenging problem for feature selection, as features are generally entangled in an ensemble of multiple trees, which makes a post hoc removal of features infeasible. Feature selection methods therefore often either focus on simple pre-filtering methods, or require many Random Forest evaluations along their optimization path, which drastically increases the computational complexity. To solve both issues, we propose Shallow Tree Selection, a novel fast and multivariate feature selection method that selects features from small tree structures. Additionally, we also adapt three standard feature selection algorithms for cost-sensitive learning by introducing a hyperparameter-controlled benefit-cost ratio criterion (BCR) for each method. In an extensive simulation study, we assess this criterion, and compare the proposed methods to multiple performance-based baseline alternatives on four artificial data settings and seven real-world data settings. We show that all methods using a hyperparameterized BCR criterion outperform the baseline alternatives. In a direct comparison between the proposed methods, each method indicates strengths in certain settings, but no one-fits-all solution exists. On a global average, we could identify preferable choices among our BCR based methods. Nevertheless, we conclude that a practical analysis should never rely on a single method only, but always compare different approaches to obtain the best results.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源