生成具有本地估计分布的披露控制的合成数据

论文标题

生成具有本地估计分布的披露控制的合成数据

Generating Synthetic Data with Locally Estimated Distributions for Disclosure Control

论文作者

Kalay, Ali Furkan

论文摘要

由于隐私问题，敏感数据集在研究和行业中通常不足，从而限制了有价值的数据驱动见解的潜力。合成数据生成提供了一种有希望的解决方案，可以通过平衡隐私保护与数据实用程序来应对这一挑战。本文介绍了一种新方法，以减轻与合成数据集中的异常观测相关的隐私风险：本地重新采样器（LR）。 LR利用$ k $ - 最近的邻居算法来生成综合数据，同时，即使在边际分布中无法检测到的披露风险不足，也可以最大程度地减少披露风险。理论和经验分析表明，LR有效地减轻了异常值驱动的披露风险，并准确地复制了多模式，偏斜和非convex支持分布。 LR的半参数性质确保了低计算负担，即使使用少量样品也有效地工作。通过参数化隐私风险和数据实用程序之间的平衡，这种方法可促进对敏感数据集进行更广泛的访问。

Sensitive datasets are often underutilized in research and industry due to privacy concerns, limiting the potential of valuable data-driven insights. Synthetic data generation presents a promising solution to address this challenge by balancing privacy protection with data utility. This paper introduces a new approach to mitigate privacy risks associated with outlier observations in synthetic datasets: the Local Resampler (LR). The LR leverages the $k$-nearest neighbors algorithm to generate synthetic data while minimizing disclosure risks by underrepresenting outliers, even when they are not detectable in marginal distributions. Theoretical and empirical analyses demonstrate that the LR effectively mitigates outlier-driven disclosure risks, and accurately replicates multimodal, skewed, and non-convex support distributions. The semiparametric nature of the LR ensures a low computational burden and works efficiently even with small samples. By parameterizing the balance between privacy risks and data utility, this approach promotes broader access to sensitive datasets for research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题