论文标题
用未知切口进行建模高维数据:融合惩罚的逻辑阈值回归
Modeling High-Dimensional Data with Unknown Cut Points: A Fusion Penalized Logistic Threshold Regression
论文作者
论文摘要
在传统的逻辑回归模型中,通常认为链接函数是线性和连续的。在这里,我们考虑了一个阈值模型,即所有连续特征被离散为序数级别,这进一步决定了二进制响应。阈值点和回归系数都是未知的,并且要估计。对于高维数据,我们提出了一个融合的逻辑阈值回归(滤波器)模型,其中使用融合的套索惩罚来控制总变化并将系数缩小到零作为变量选择的方法。在轻度条件下,根据未知阈值点的估计,我们建立了用于系数估计和模型选择一致性的非肌电误差。通过仔细表征误差传播,我们还表明,基于树的方法(例如CART)满足阈值估计条件。我们发现,使用身体检查数据,过滤模型非常适合于糖尿病等慢性疾病(如糖尿病)的早期检测和预测问题。还探索了我们提出的方法的有限样本行为,并与广泛的蒙特卡洛研究进行了比较,该研究支持我们的理论发现。
In traditional logistic regression models, the link function is often assumed to be linear and continuous in predictors. Here, we consider a threshold model that all continuous features are discretized into ordinal levels, which further determine the binary responses. Both the threshold points and regression coefficients are unknown and to be estimated. For high dimensional data, we propose a fusion penalized logistic threshold regression (FILTER) model, where a fused lasso penalty is employed to control the total variation and shrink the coefficients to zero as a method of variable selection. Under mild conditions on the estimate of unknown threshold points, we establish the non-asymptotic error bound for coefficient estimation and the model selection consistency. With a careful characterization of the error propagation, we have also shown that the tree-based method, such as CART, fulfill the threshold estimation conditions. We find the FILTER model is well suited in the problem of early detection and prediction for chronic disease like diabetes, using physical examination data. The finite sample behavior of our proposed method are also explored and compared with extensive Monte Carlo studies, which supports our theoretical discoveries.