为回归建模选择好的子样本

论文标题

为回归建模选择好的子样本

Choosing good subsamples for regression modelling

论文作者

Lumley, Thomas, Chen, Tong

论文摘要

健康研究中的一个普遍问题是，我们有一个大数据库，其中有许多变量在大量个体上衡量。我们有兴趣测量子样本上的其他变量。这些测量可能是新近可用的，也可以是昂贵的，或者首先收集数据时根本不考虑。新测量结果的预期用途是适合整个队列（及其源人群）的回归模型。这是一个两阶段的抽样问题；它与I阶段数据丰富性的其他一些两相采样问题以及回归建模的目标有所不同。特别是，一个重要的特殊情况是测量误差模型，其中与II期测量的可变相关。我们将在此设置中对基于设计的估计器和基于模型的估计器之间的信息差距进行一些评论。

A common problem in health research is that we have a large database with many variables measured on a large number of individuals. We are interested in measuring additional variables on a subsample; these measurements may be newly available, or expensive, or simply not considered when the data were first collected. The intended use for the new measurements is to fit a regression model generalisable to the whole cohort (and to its source population). This is a two-phase sampling problem; it differs from some other two-phase sampling problems in the richness of the phase I data and in the goal of regression modelling. In particular, an important special case is measurement-error models, where a variable strongly correlated with the phase II measurements is available at phase I. We will explain how influence functions have been useful as a unifying concept for extending classical results to this setting, and describe the steps from designing for a simple weighted estimator at known parameter values through adaptive multiwave designs and the use of prior information. We will conclude with some comments on the information gap between design-based and model-based estimators in this setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题