论文标题
用购物车稀疏学习
Sparse learning with CART
论文作者
论文摘要
用二进制拆分的决策树通常使用分类和回归树(CART)方法构建。对于回归模型,这种方法会根据一个分离点递归将数据划分为两个近亲的女儿节点,该分分点沿特定变量最大化了平方误差之和(杂质)的减少。本文旨在研究用卡车方法构建的回归树的统计特性。在这样做时,我们发现训练错误受每个节点中最佳决策残势与响应数据之间的Pearson相关性的控制,我们通过在分差点上构造先前的分布并解决非线性优化问题来束缚。我们利用训练错误和皮尔森相关性之间的这种联系,以表明,当深度与样本量的对数缩放时,具有成本复杂性修剪的购物车实现了最佳的复杂性/贴合性权衡。数据依赖性数量(适应回归模型的维度和潜在结构)可以控制预测误差的收敛速率。
Decision trees with binary splits are popularly constructed using Classification and Regression Trees (CART) methodology. For regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) along a particular variable. This paper aims to study the statistical properties of regression trees constructed with CART methodology. In doing so, we find that the training error is governed by the Pearson correlation between the optimal decision stump and response data in each node, which we bound by constructing a prior distribution on the split points and solving a nonlinear optimization problem. We leverage this connection between the training error and Pearson correlation to show that CART with cost-complexity pruning achieves an optimal complexity/goodness-of-fit tradeoff when the depth scales with the logarithm of the sample size. Data dependent quantities, which adapt to the dimensionality and latent structure of the regression model, are seen to govern the rates of convergence of the prediction error.