论文标题
单词级别差异隐私的限制
The Limits of Word Level Differential Privacy
论文作者
论文摘要
随着隐私和信任的问题在研究社区中受到越来越多的关注,已经对匿名文本数据进行了各种尝试。这些方法的重要子集结合了不同的私人机制来烧毁单词嵌入,从而替换了句子中的单个单词。尽管这些方法代表了非常重要的贡献,比其他技术具有各种优势,并且确实显示了匿名功能,但它们有几个缺点。在本文中,我们研究了这些弱点,并证明了显着的数学约束,从而减少了理论隐私保证,以及关于防止脱名象征性攻击的主要实际缺点,原始句子内容的保存以及语言输出的质量。最后,我们根据基于变压器的语言模型进行了微调,提出了一种新的文本匿名方法,以阐明大多数已确定的弱点,并提供正式的隐私保证。我们通过彻底的实验评估了方法的性能,并证明了与讨论机制相比的表现优越。
As the issues of privacy and trust are receiving increasing attention within the research community, various attempts have been made to anonymize textual data. A significant subset of these approaches incorporate differentially private mechanisms to perturb word embeddings, thus replacing individual words in a sentence. While these methods represent very important contributions, have various advantages over other techniques and do show anonymization capabilities, they have several shortcomings. In this paper, we investigate these weaknesses and demonstrate significant mathematical constraints diminishing the theoretical privacy guarantee as well as major practical shortcomings with regard to the protection against deanonymization attacks, the preservation of content of the original sentences as well as the quality of the language output. Finally, we propose a new method for text anonymization based on transformer based language models fine-tuned for paraphrasing that circumvents most of the identified weaknesses and also offers a formal privacy guarantee. We evaluate the performance of our method via thorough experimentation and demonstrate superior performance over the discussed mechanisms.