论文标题
过度乐观的预测结果不平衡数据:应用过度采样时的缺陷和利益案例研究
Overly Optimistic Prediction Results on Imbalanced Data: a Case Study of Flaws and Benefits when Applying Over-sampling
论文作者
论文摘要
从电术记录中提取的信息可能被证明是有趣的其他信息来源,以估计早产的风险。最近,大量研究报告了几乎完美的结果,以区分使用公共资源(称为术语/早产术数据库)提供期限或早产的患者记录。但是,我们认为,由于存在方法论缺陷,这些结果过于乐观。在这项工作中,我们专注于一种特定类型的方法论缺陷:在将数据分配到相互排他性培训和测试集中之前,将过度采样应用。我们展示了这是如何使用两个人工数据集对结果产生偏见的,并重现了发现该缺陷的研究结果。此外,我们评估过度采样对预测性能的实际影响,当在数据分配之前使用相同的相关研究方法,以提供对这些方法论的概括能力的现实观点。我们通过在开放许可下提供所有代码来使我们的研究可再现。
Information extracted from electrohysterography recordings could potentially prove to be an interesting additional source of information to estimate the risk on preterm birth. Recently, a large number of studies have reported near-perfect results to distinguish between recordings of patients that will deliver term or preterm using a public resource, called the Term/Preterm Electrohysterogram database. However, we argue that these results are overly optimistic due to a methodological flaw being made. In this work, we focus on one specific type of methodological flaw: applying over-sampling before partitioning the data into mutually exclusive training and testing sets. We show how this causes the results to be biased using two artificial datasets and reproduce results of studies in which this flaw was identified. Moreover, we evaluate the actual impact of over-sampling on predictive performance, when applied prior to data partitioning, using the same methodologies of related studies, to provide a realistic view of these methodologies' generalization capabilities. We make our research reproducible by providing all the code under an open license.