论文标题
对可变重要性估计的新兴方法的计算探索
A Computational Exploration of Emerging Methods of Variable Importance Estimation
论文作者
论文摘要
估计变量的重要性是现代机器学习的重要任务。这有助于评估给定模型中功能的优点。在过去的十年中,已经开发了几种估计变量重要性的技术。在本文中,我们提出了对重要性估计的新兴方法的计算和理论探索,即:绝对绝对收缩和选择操作员(LASSO),支持向量机(SVM),预测误差函数(PERT),随机森林(RF),以及极端梯度增强(xgboost)(xgboost)(xgboost)(XGBOOST)对不同的数据进行了测试。所有这些方法都可以无缝处理回归和分类任务,但是在处理包含丢失值的数据时都失败了。该实现表明,在高度相关数据的情况下,PURD的性能最佳,然后是RF。 perf和xgboost是“渴望数据”的方法,它们在小数据尺寸上的性能最差,但在执行时间方面它们是最快的。当数据集中许多冗余功能时,SVM是最合适的。 perf的盈余是其自然截止值的零截止,有助于将正分数和负分数分开,所有正分数表明基本和重要的特征,而负面分数则表明无用的特征。 RF和Lasso的通用性非常多,尽管它们并没有提供最佳效果,但它们几乎可以在所有情况下使用。
Estimating the importance of variables is an essential task in modern machine learning. This help to evaluate the goodness of a feature in a given model. Several techniques for estimating the importance of variables have been developed during the last decade. In this paper, we proposed a computational and theoretical exploration of the emerging methods of variable importance estimation, namely: Least Absolute Shrinkage and Selection Operator (LASSO), Support Vector Machine (SVM), the Predictive Error Function (PERF), Random Forest (RF), and Extreme Gradient Boosting (XGBOOST) that were tested on different kinds of real-life and simulated data. All these methods can handle both regression and classification tasks seamlessly but all fail when it comes to dealing with data containing missing values. The implementation has shown that PERF has the best performance in the case of highly correlated data closely followed by RF. PERF and XGBOOST are "data-hungry" methods, they had the worst performance on small data sizes but they are the fastest when it comes to the execution time. SVM is the most appropriate when many redundant features are in the dataset. A surplus with the PERF is its natural cut-off at zero helping to separate positive and negative scores with all positive scores indicating essential and significant features while the negatives score indicates useless features. RF and LASSO are very versatile in a way that they can be used in almost all situations despite they are not giving the best results.