论文标题
NeuromixGDP:一种神经崩溃的随机混合,用于私人数据发布
NeuroMixGDP: A Neural Collapse-Inspired Random Mixup for Private Data Release
论文作者
论文摘要
隐私保护数据发布算法在启用下游机器学习任务的同时,越来越关注其保护用户隐私的能力。但是,当前流行算法的实用性并不总是令人满意的。原始数据的混合提供了一种新的数据增强方式,可以帮助改善实用性。但是,当添加差异隐私(DP)噪声时,其性能会大大恶化。为了解决这个问题,本文从最近观察到的神经崩溃(NC)现象中汲取灵感,该现象指出,神经网络的最后一层特征集中在单纯形的角膜上,作为等缘紧密框架(ETF)。我们提出了一种混合神经塌陷特征以利用ETF单纯形结构并释放嘈杂的混合特征以增强释放数据的实用性的方案。通过使用高斯差异隐私(GDP),我们获得了最佳混合度的渐近率。为了进一步增强效用并解决混合度较高时标签崩溃问题,我们提出了一种分层抽样方法,以在少数类上对混合样品进行分层。当类数量较大时,此方法会显着改善效用。广泛的实验证明了我们提出的方法在保护攻击和改善效用方面的有效性。特别是,与直接训练与CIFAR100和Miniimagenet数据集的DPSGD直接训练分类网络相比,我们的方法显着改善了实用性,这突出了使用隐私数据发布的好处。我们在https://github.com/lidonghao1996/neuromixgdp中发布可重复的代码。
Privacy-preserving data release algorithms have gained increasing attention for their ability to protect user privacy while enabling downstream machine learning tasks. However, the utility of current popular algorithms is not always satisfactory. Mixup of raw data provides a new way of data augmentation, which can help improve utility. However, its performance drastically deteriorates when differential privacy (DP) noise is added. To address this issue, this paper draws inspiration from the recently observed Neural Collapse (NC) phenomenon, which states that the last layer features of a neural network concentrate on the vertices of a simplex as Equiangular Tight Frame (ETF). We propose a scheme to mixup the Neural Collapse features to exploit the ETF simplex structure and release noisy mixed features to enhance the utility of the released data. By using Gaussian Differential Privacy (GDP), we obtain an asymptotic rate for the optimal mixup degree. To further enhance the utility and address the label collapse issue when the mixup degree is large, we propose a Hierarchical sampling method to stratify the mixup samples on a small number of classes. This method remarkably improves utility when the number of classes is large. Extensive experiments demonstrate the effectiveness of our proposed method in protecting against attacks and improving utility. In particular, our approach shows significantly improved utility compared to directly training classification networks with DPSGD on CIFAR100 and MiniImagenet datasets, highlighting the benefits of using privacy-preserving data release. We release reproducible code in https://github.com/Lidonghao1996/NeuroMixGDP.