论文标题
CTAB-GAN+:增强表格数据合成
CTAB-GAN+: Enhancing Tabular Data Synthesis
论文作者
论文摘要
尽管数据共享对于知识发展至关重要,但隐私问题和严格的法规(例如欧洲一般数据保护法规(GDPR))限制了其全部效率。合成表格数据是启用数据共享的替代方法,同时实现了监管和隐私约束。最先进的表格数据合成器从生成对抗网络(GAN)绘制方法。随着GAN的改善,综合数据越来越类似于泄漏隐私的真实数据风险。差异隐私(DP)提供了有关隐私损失的理论保证,但会降低数据实用性。达到最佳权衡仍然是一个充满挑战的研究问题。我们提出了ctab-gan+一种新型的条件表格gan。 CTAB-GAN+通过(i)在最先进的情况下改善了(i)在分类和回归域中为有条件的gan添加下游损失,以提高较高的效用合成数据; (ii)使用瓦斯坦因损失和梯度罚款来更好地训练融合; (iii)引入针对混合连续类别变量和具有不平衡或偏斜数据的变量的新颖编码器; (iv)DP随机梯度下降的培训,以施加严格的隐私保证。我们广泛评估CTAB-GAN+对数据相似性和分析实用程序,以针对最先进的表格gan。结果表明,CTAB-GAN+合成了隐私保护数据,在不同的数据集中,多个数据集的公用事业至少高48.16%,并且在不同的隐私预算下学习任务。
While data sharing is crucial for knowledge development, privacy concerns and strict regulation (e.g., European General Data Protection Regulation (GDPR)) limit its full effectiveness. Synthetic tabular data emerges as alternative to enable data sharing while fulfilling regulatory and privacy constraints. State-of-the-art tabular data synthesizers draw methodologies from Generative Adversarial Networks (GAN). As GANs improve the synthesized data increasingly resemble the real data risking to leak privacy. Differential privacy (DP) provides theoretical guarantees on privacy loss but degrades data utility. Striking the best trade-off remains yet a challenging research question. We propose CTAB-GAN+ a novel conditional tabular GAN. CTAB-GAN+ improves upon state-of-the-art by (i) adding downstream losses to conditional GANs for higher utility synthetic data in both classification and regression domains; (ii) using Wasserstein loss with gradient penalty for better training convergence; (iii) introducing novel encoders targeting mixed continuous-categorical variables and variables with unbalanced or skewed data; and (iv) training with DP stochastic gradient descent to impose strict privacy guarantees. We extensively evaluate CTAB-GAN+ on data similarity and analysis utility against state-of-the-art tabular GANs. The results show that CTAB-GAN+ synthesizes privacy-preserving data with at least 48.16% higher utility across multiple datasets and learning tasks under different privacy budgets.