论文标题
一种比较多种归档技术的方法:关于美国国家共同队列合作的案例研究
A method for comparing multiple imputation techniques: a case study on the U.S. National COVID Cohort Collaborative
论文作者
论文摘要
事实证明,从电子健康记录获得的医疗保健数据集对评估患者的预测因子和感兴趣结果之间的关联非常有用。但是,这些数据集通常在很大比例的情况下遇到缺失值,而这些案例的简单删除可能会引起严重的偏见。由于这些原因,已经提出了几种多种归档算法来尝试恢复丢失的信息。每种算法都呈现出优势和劣势,目前尚无共识在给定情况下多种归档算法最有效的共识。此外,选择每种算法参数和与数据相关的建模选择的选择也至关重要且具有挑战性。在本文中,我们提出了一个新颖的框架,以在统计分析的背景下评估处理缺失数据的策略,并特别关注多种插补技术。我们证明了我们的方法对由国家Covid队列合作(N3C)Enclave提供的大量2型糖尿病患者的可行性,在那里我们探讨了各种患者特征对与Covid-19相关结果的影响。我们的分析包括经典的多个插补技术以及简单的完整反向概率加权模型。这里提出的实验表明,我们的方法可以有效地强调我们案例研究的最有效和性能的失踪数据处理策略。此外,我们的方法论使我们能够了解不同模型的行为以及在修改其参数时如何变化。我们的方法是一般的,可以应用于不同的研究字段和包含异质类型的数据集。
Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful to assess associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases and the simple removal of these cases may introduce severe bias. For these reasons, several multiple imputation algorithms have been proposed to attempt to recover the missing information. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithms works best in a given scenario. Furthermore, the selection of each algorithm parameters and data-related modelling choices are also both crucial and challenging. In this paper, we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. The experiments presented here show that our approach could effectively highlight the most valid and performant missing-data handling strategy for our case study. Moreover, our methodology allowed us to gain an understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.