论文标题
大数据设置中通用线性模型的模型可靠的子采样方法
A model robust sub-sampling approach for Generalised Linear Models in Big data settings
论文作者
论文摘要
在当今的大数据现代时代,需要采用计算高效且可扩展的方法来支持及时的见解和明智的决策。一种这样的方法是子采样,其中大数据的一个子集被用作推理的基础,而不是考虑整个数据集。应用子采样方法时的一个关键问题是如何根据数据的问题选择信息子集。根据确定每个数据点的子采样概率,已经提出了一种最新方法,但是这种方法的限制是适当的子采样概率依赖于大数据的假定模型。在本文中,为了克服这一限制,我们提出了一种模型可靠的方法,其中考虑了一组模型,并根据概率的加权平均值来评估子采样概率,如果每个模型都被视为单一的概率。提供了对这种方法的理论支持。我们的模型可靠的子采样方法应用于模拟研究,以及在两个现实世界应用中与当前的子采样实践进行了比较。结果表明,我们的模型强大方法的表现优于替代方法。
In today's modern era of Big data, computationally efficient and scalable methods are needed to support timely insights and informed decision making. One such method is sub-sampling, where a subset of the Big data is analysed and used as the basis for inference rather than considering the whole data set. A key question when applying sub-sampling approaches is how to select an informative subset based on the questions being asked of the data. A recent approach for this has been proposed based on determining sub-sampling probabilities for each data point, but a limitation of this approach is that appropriate sub-sampling probabilities rely on an assumed model for the Big data. In this article, to overcome this limitation, we propose a model robust approach where a set of models is considered, and the sub-sampling probabilities are evaluated based on the weighted average of probabilities that would be obtained if each model was considered singularly. Theoretical support for such an approach is provided. Our model robust sub-sampling approach is applied in a simulation study and in two real world applications where performance is compared to current sub-sampling practices. The results show that our model robust approach outperforms alternative approaches.