论文标题
综合数据生成方法的公用事业评估
Utility Assessment of Synthetic Data Generation Methods
论文作者
论文摘要
大数据分析构成了隐私保存和实用性的双重问题,即,在转换原始数据之后,准确的数据分析如何保留,以保护数据与数据有关的个人隐私 - 以及它们是否足够准确以有意义。因此,在本文中,我们在几个数据集中调查了生成完全合成数据的不同方法是否在其实用程序中有所不同(当尚不知道要在数据上执行的具体分析时),它们的结果与对原始数据的分析a后验的分析以及这两个效果是否相关。我们发现一些方法(基于决策-tree)可以比其他方面的表现更好,这对某些选择参数选择(尤其是已发布的数据集的数量),较大的效用指标与分析准确性之间没有相关性,并且狭窄指标的相关性不同。在使用合成数据进行培训机器学习模型时,我们确实获得了有希望的分类任务发现,我们认为值得在减轻针对ML模型(例如会员推理和模型反转)的隐私攻击方面进行进一步探索。
Big data analysis poses the dual problem of privacy preservation and utility, i.e., how accurate data analyses remain after transforming original data in order to protect the privacy of the individuals that the data is about - and whether they are accurate enough to be meaningful. In this paper, we thus investigate across several datasets whether different methods of generating fully synthetic data vary in their utility a priori (when the specific analyses to be performed on the data are not known yet), how closely their results conform to analyses on original data a posteriori, and whether these two effects are correlated. We find some methods (decision-tree based) to perform better than others across the board, sizeable effects of some choices of imputation parameters (notably the number of released datasets), no correlation between broad utility metrics and analysis accuracy, and varying correlations for narrow metrics. We did get promising findings for classification tasks when using synthetic data for training machine learning models, which we consider worth exploring further also in terms of mitigating privacy attacks against ML models such as membership inference and model inversion.