论文标题
用于隐私机器学习的合成数据集生成
Synthetic Dataset Generation for Privacy-Preserving Machine Learning
论文作者
论文摘要
机器学习(ML)在解决计算机视觉,语音识别,对象检测中的各种问题方面取得了巨大成功,仅举几例。取得成功的主要原因是培训深神经网络(DNNS)的庞大数据集可用。但是,如果数据集包含敏感信息,例如医疗或财务记录,则无法公开发布。在这种情况下,数据隐私成为主要问题。加密方法为此问题提供了一种可能的解决方案,但是它们在ML应用程序上的部署是不平凡的,因为它们会严重影响分类的准确性并导致大量的计算开销。替代地,可以使用混淆技术,但是保持良好的平衡和准确性之间的平衡是挑战性的。在这项工作中,我们提出了一种从原始私有数据集生成安全合成数据集的方法。在我们的方法中,给定一个在原始数据集上预先训练的网络,我们首先记录层的BN统计信息。接下来,使用BN统计数据和预训练的模型,我们通过优化随机噪声来生成合成数据集,以使合成数据与原始模型的图层统计分布相匹配。我们在图像分类数据集(CIFAR10)上评估我们的方法,并表明我们的合成数据可用于从头开始训练网络,从而产生合理的分类性能。
Machine Learning (ML) has achieved enormous success in solving a variety of problems in computer vision, speech recognition, object detection, to name a few. The principal reason for this success is the availability of huge datasets for training deep neural networks (DNNs). However, datasets can not be publicly released if they contain sensitive information such as medical or financial records. In such cases, data privacy becomes a major concern. Encryption methods offer a possible solution to this issue, however their deployment on ML applications is non-trivial, as they seriously impact the classification accuracy and result in substantial computational overhead.Alternatively, obfuscation techniques can be used, but maintaining a good balance between visual privacy and accuracy is challenging. In this work, we propose a method to generate secure synthetic datasets from the original private datasets. In our method, given a network with Batch Normalization (BN) layers pre-trained on the original dataset, we first record the layer-wise BN statistics. Next, using the BN statistics and the pre-trained model, we generate the synthetic dataset by optimizing random noises such that the synthetic data match the layer-wise statistical distribution of the original model. We evaluate our method on image classification dataset (CIFAR10) and show that our synthetic data can be used for training networks from scratch, producing reasonable classification performance.