论文标题
损失引导的稳定性选择
Loss-guided Stability Selection
论文作者
论文摘要
在现代数据分析中,一旦预测变量的数量很高,稀疏模型选择就变得不可避免。众所周知,诸如Lasso或Boosting之类的模型选择程序倾向于在实际数据上过度拟合。著名的稳定性选择通过基于训练数据的子样本来克服这些弱点,然后选择一个稳定的预测器集,通常比原始模型中的预测值要稀疏得多。标准稳定性选择基于全局标准,即每个家庭错误率,同时需要专家知识来适当配置超参数。由于模型选择取决于损耗函数,即选择W.R.T.的预测器集。一些特定的损失函数与选定的W.R.T.不同。其他一些损失功能,我们提出了一个稳定性选择变体,该变体通过基于样本外验证数据的附加验证步骤来尊重所选损失函数,并可以选择通过详尽的搜索策略来增强。我们的稳定性选择变体广泛适用且用户友好。此外,我们的稳定性选择变体可以避免严重不足的问题,这会影响嘈杂的高维数据的原始稳定性选择,因此我们的优先级不是不惜一切代价避免假阳性,而是导致一个稀疏的稳定模型,可以通过该模型进行预测。我们考虑回归和二进制分类的实验以及使用增强算法的实验表明,与原始促进模型相比,同时没有遇到任何原始稳定性选择的问题,这表明了显着的精确改善。
In modern data analysis, sparse model selection becomes inevitable once the number of predictors variables is very high. It is well-known that model selection procedures like the Lasso or Boosting tend to overfit on real data. The celebrated Stability Selection overcomes these weaknesses by aggregating models, based on subsamples of the training data, followed by choosing a stable predictor set which is usually much sparser than the predictor sets from the raw models. The standard Stability Selection is based on a global criterion, namely the per-family error rate, while additionally requiring expert knowledge to suitably configure the hyperparameters. Since model selection depends on the loss function, i.e., predictor sets selected w.r.t. some particular loss function differ from those selected w.r.t. some other loss function, we propose a Stability Selection variant which respects the chosen loss function via an additional validation step based on out-of-sample validation data, optionally enhanced with an exhaustive search strategy. Our Stability Selection variants are widely applicable and user-friendly. Moreover, our Stability Selection variants can avoid the issue of severe underfitting which affects the original Stability Selection for noisy high-dimensional data, so our priority is not to avoid false positives at all costs but to result in a sparse stable model with which one can make predictions. Experiments where we consider both regression and binary classification and where we use Boosting as model selection algorithm reveal a significant precision improvement compared to raw Boosting models while not suffering from any of the mentioned issues of the original Stability Selection.