论文标题
潜在的DIRICHLET分配模型培训具有不同的隐私
Latent Dirichlet Allocation Model Training with Differential Privacy
论文作者
论文摘要
潜在的DIRICHLET分配(LDA)是一种流行的主题建模技术,用于隐藏的语义发现文本数据,并用作各种应用程序中文本分析的基本工具。但是,LDA模型以及LDA的培训过程可能会在培训数据中揭示文本信息,从而引起了很大的隐私问题。为了解决LDA中的隐私问题,我们根据倒塌的吉布斯采样(CGS)系统地研究了主流LDA培训算法的隐私保护,并提出了几种典型的私人LDA算法,以实现典型的培训方案。特别是,我们介绍了基于CGS的LDA培训的固有差异隐私保证的第一个理论分析,并进一步提出了一种集中式隐私性算法(HDP-LDA),该算法(HDP-LDA)可以防止CGS培训中的中间统计数据的数据推断。此外,我们建议在众包数据上使用本地私人LDA培训算法(LP-LP-LP-LP-LP-LP-LP-LDA),以为单个数据贡献者提供当地的差异隐私。此外,我们将LP-LDA扩展到在线版本中,作为OLP-LDA,以在流媒体环境中对本地私人迷你批次进行LDA培训。广泛的分析和实验结果证明了我们提出的保护隐私LDA培训算法的有效性和效率。
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for hidden semantic discovery of text data and serves as a fundamental tool for text analysis in various applications. However, the LDA model as well as the training process of LDA may expose the text information in the training data, thus bringing significant privacy concerns. To address the privacy issue in LDA, we systematically investigate the privacy protection of the main-stream LDA training algorithm based on Collapsed Gibbs Sampling (CGS) and propose several differentially private LDA algorithms for typical training scenarios. In particular, we present the first theoretical analysis on the inherent differential privacy guarantee of CGS based LDA training and further propose a centralized privacy-preserving algorithm (HDP-LDA) that can prevent data inference from the intermediate statistics in the CGS training. Also, we propose a locally private LDA training algorithm (LP-LDA) on crowdsourced data to provide local differential privacy for individual data contributors. Furthermore, we extend LP-LDA to an online version as OLP-LDA to achieve LDA training on locally private mini-batches in a streaming setting. Extensive analysis and experiment results validate both the effectiveness and efficiency of our proposed privacy-preserving LDA training algorithms.