论文标题

FEDSYN:使用联合学习的合成数据生成

FedSyn: Synthetic Data Generation using Federated Learning

论文作者

Behera, Monik Raj, Upadhyay, Sudhir, Shetty, Suresh, Priyadarshini, Sudha, Patel, Palka, Lee, Ker Farn

论文摘要

随着深度学习算法继续发展并变得更加复杂,它们需要大量的数据集来进行模型培训和模型的功效。这些数据要求中的一些可以在组织内现有数据集的帮助下满足。当前的机器学习实践可以利用以从现有数据集生成合成数据。此外,众所周知,生成的合成数据的多样性依赖于单个组织或实体中可用数据集的统计属性(也许受到限制)。现有数据集的多样性越多,可以表达和通用的合成数据越多。但是,考虑到基本数据的稀缺性,在一个组织中整理大数据是一项挑战。不同组织的多样化,非重叠的数据集为他们提供了一个机会,可以将其有限的不同数据贡献到可以利用的较大池以进一步合成的较大池。不幸的是,这引起了一些机构可能不满意的数据隐私问题。 本文提出了一种新的生成合成数据的方法-Fedsyn。 Fedsyn是一种协作,隐私的方法,可以在联合和协作网络中的多个参与者之间生成综合数据。 Fedsyn创建了一个合成数据生成模型,该模型可以生成由网络中几乎所有参与者的统计分布组成的合成数据。 Fedsyn不需要访问个人参与者的数据,因此可以保护参与者数据的隐私。本文中提出的技术利用联合机器学习和生成对抗网络(GAN)作为合成数据生成的神经网络体系结构。所提出的方法可以扩展到许多机器学习问题类别中的金融,健康,治理,技术等等。

As Deep Learning algorithms continue to evolve and become more sophisticated, they require massive datasets for model training and efficacy of models. Some of those data requirements can be met with the help of existing datasets within the organizations. Current Machine Learning practices can be leveraged to generate synthetic data from an existing dataset. Further, it is well established that diversity in generated synthetic data relies on (and is perhaps limited by) statistical properties of available dataset within a single organization or entity. The more diverse an existing dataset is, the more expressive and generic synthetic data can be. However, given the scarcity of underlying data, it is challenging to collate big data in one organization. The diverse, non-overlapping dataset across distinct organizations provides an opportunity for them to contribute their limited distinct data to a larger pool that can be leveraged to further synthesize. Unfortunately, this raises data privacy concerns that some institutions may not be comfortable with. This paper proposes a novel approach to generate synthetic data - FedSyn. FedSyn is a collaborative, privacy preserving approach to generate synthetic data among multiple participants in a federated and collaborative network. FedSyn creates a synthetic data generation model, which can generate synthetic data consisting of statistical distribution of almost all the participants in the network. FedSyn does not require access to the data of an individual participant, hence protecting the privacy of participant's data. The proposed technique in this paper leverages federated machine learning and generative adversarial network (GAN) as neural network architecture for synthetic data generation. The proposed method can be extended to many machine learning problem classes in finance, health, governance, technology and many more.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源