论文标题

数据套件:以数据为中心的识别不协调的示例

Data-SUITE: Data-centric identification of in-distribution incongruous examples

论文作者

Seedat, Nabeel, Crabbé, Jonathan, van der Schaar, Mihaela

论文摘要

数据质量的系统量化对于一致的模型性能至关重要。先前的工作重点是分布数据。取而代之的是,我们解决了一个研究不一致的分布区域(ID)数据的研究,这可能是由特征空间异质性引起的。为此,我们提出了使用数据套件的范式转移:一个以数据为中心的AI框架来识别这些区域,而与特定于任务的模型无关。数据套件利用了基于一组培训实例的构建功能置信区间估计器的副物建模,表示学习和保形预测。这些估计器可用于评估有关培训集的测试实例的一致性,以回答两个实际上有用的问题:(1)通过培训培训实例培训的模型可以可靠地预测哪些测试实例? (2)我们可以确定功能空间的不协调区域,以便数据所有者了解数据的局限性还是指导未来数据收集?我们从经验上验证了数据套件的性能和覆盖范围保证,并在跨站点的医疗数据,有偏见的数据以及具有概念漂移的数据上证明,数据套件最能确定下游模型可能是可靠的ID区域(独立于所述模型)。我们还说明了这些确定的区域如何为数据集提供见解并突出其局限性。

Systematic quantification of data quality is critical for consistent model performance. Prior works have focused on out-of-distribution data. Instead, we tackle an understudied yet equally important problem of characterizing incongruous regions of in-distribution (ID) data, which may arise from feature space heterogeneity. To this end, we propose a paradigm shift with Data-SUITE: a data-centric AI framework to identify these regions, independent of a task-specific model. Data-SUITE leverages copula modeling, representation learning, and conformal prediction to build feature-wise confidence interval estimators based on a set of training instances. These estimators can be used to evaluate the congruence of test instances with respect to the training set, to answer two practically useful questions: (1) which test instances will be reliably predicted by a model trained with the training instances? and (2) can we identify incongruous regions of the feature space so that data owners understand the data's limitations or guide future data collection? We empirically validate Data-SUITE's performance and coverage guarantees and demonstrate on cross-site medical data, biased data, and data with concept drift, that Data-SUITE best identifies ID regions where a downstream model may be reliable (independent of said model). We also illustrate how these identified regions can provide insights into datasets and highlight their limitations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源