大规模数据的两阶段鲁棒和稀疏分布式统计推断

论文标题

大规模数据的两阶段鲁棒和稀疏分布式统计推断

Two-Stage Robust and Sparse Distributed Statistical Inference for Large-Scale Data

论文作者

Mozafari-Majd, Emadaldin, Koivunen, Visa

论文摘要

在本文中，我们解决了在涉及大规模数据的设置中进行统计推断的问题，这些数据可能是高度的，并且被异常值污染。数据的大量和维度需要分布式处理和存储解决方案。我们提出了一个两阶段分布和强大的统计推理程序，通过促进稀疏性来应对高维模型。在第一阶段（称为模型选择）中，相关预测因子是通过将强大的套索估计器应用于数据的不同子集来局部选择的。然后，从每个计算节点中的变量选择通过投票方案融合，以找到完整数据集的稀疏基础。它以强大的方式识别相关变量。在第二阶段，采用了开发的统计上稳定性和计算有效的引导方法。实际推论构建体间隔，找到参数估计并量化标准偏差。与第1阶段类似，将局部推理的结果传达给融合中心并在此组合。通过使用分析方法，我们建立了鲁棒和计算有效的引导方法的有利统计特性，包括固定数量的预测因子和鲁棒性的一致性。提出的两阶段鲁棒和分布式推理程序在变量选择中表现出可靠的性能和鲁棒性，即使数据是高度且被异常值污染的，可以找到置信区间和标准偏差的自举近似。

In this paper, we address the problem of conducting statistical inference in settings involving large-scale data that may be high-dimensional and contaminated by outliers. The high volume and dimensionality of the data require distributed processing and storage solutions. We propose a two-stage distributed and robust statistical inference procedures coping with high-dimensional models by promoting sparsity. In the first stage, known as model selection, relevant predictors are locally selected by applying robust Lasso estimators to the distinct subsets of data. The variable selections from each computation node are then fused by a voting scheme to find the sparse basis for the complete data set. It identifies the relevant variables in a robust manner. In the second stage, the developed statistically robust and computationally efficient bootstrap methods are employed. The actual inference constructs confidence intervals, finds parameter estimates and quantifies standard deviation. Similar to stage 1, the results of local inference are communicated to the fusion center and combined there. By using analytical methods, we establish the favorable statistical properties of the robust and computationally efficient bootstrap methods including consistency for a fixed number of predictors, and robustness. The proposed two-stage robust and distributed inference procedures demonstrate reliable performance and robustness in variable selection, finding confidence intervals and bootstrap approximations of standard deviations even when data is high-dimensional and contaminated by outliers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题