论文标题
你忘记了吗?一种评估机器学习模型是否忘记数据的方法
Have you forgotten? A method to assess if machine learning models have forgotten data
论文作者
论文摘要
在深度学习的时代,来自多个来源的数据汇总是确保数据多样性的常见方法。让我们考虑一个场景,其中一些提供商向联盟贡献了分类模型的联合发展(以下称目标模型)的联盟,但是现在,现在的提供商之一决定离开。该提供商要求将其数据(以后查询数据集)从数据库中删除,还可以将模型“忘记”其数据删除。在本文中,我们首次要解决一个具有挑战性的问题,即模型是否忘记了数据。我们假设查询数据集和模型输出的分布。我们建立了将目标输出与使用不同数据集训练的模型的输出进行比较的统计方法。我们使用自动化心脏诊断挑战(ACDC)的数据评估了几个基准数据集(MNIST,CIFAR-10和SVHN)以及心脏病理诊断任务的方法。我们希望鼓励研究模型保留哪些信息并激发更复杂的环境中的扩展。
In the era of deep learning, aggregation of data from several sources is a common approach to ensuring data diversity. Let us consider a scenario where several providers contribute data to a consortium for the joint development of a classification model (hereafter the target model), but, now one of the providers decides to leave. This provider requests that their data (hereafter the query dataset) be removed from the databases but also that the model `forgets' their data. In this paper, for the first time, we want to address the challenging question of whether data have been forgotten by a model. We assume knowledge of the query dataset and the distribution of a model's output. We establish statistical methods that compare the target's outputs with outputs of models trained with different datasets. We evaluate our approach on several benchmark datasets (MNIST, CIFAR-10 and SVHN) and on a cardiac pathology diagnosis task using data from the Automated Cardiac Diagnosis Challenge (ACDC). We hope to encourage studies on what information a model retains and inspire extensions in more complex settings.