论文标题
一种易于使用且强大的方法,用于差异化私人去识别临床文本文档
An Easy-to-use and Robust Approach for the Differentially Private De-Identification of Clinical Textual Documents
论文作者
论文摘要
非结构化的文本数据是医疗保健系统的核心。出于明显的隐私原因,只要研究人员包含个人身份信息,这些文件就无法访问。在尊重立法框架(特别是GDPR或HIPAA)的同时共享这些数据的一种方法是在医学结构中取消识别,即,通过指定的实体识别(NER)系统检测人的个人信息,然后更换它以使其很难与该文件联系起来。面临的挑战是拥有可靠的NER和替代工具,而不会损害文档中的机密性和一致性。大多数进行的研究都集中在英国医疗文件上,并没有受益于隐私的进步。本文展示了如何通过加强较不健壮的去识别方法以及通过为替代目的调整最新私人机制来实现高效且差异化的去识别方法。结果是一种使用法语识别临床文档的方法,但也可以推广到其他语言,并且在数学上证明其鲁棒性。
Unstructured textual data is at the heart of healthcare systems. For obvious privacy reasons, these documents are not accessible to researchers as long as they contain personally identifiable information. One way to share this data while respecting the legislative framework (notably GDPR or HIPAA) is, within the medical structures, to de-identify it, i.e. to detect the personal information of a person through a Named Entity Recognition (NER) system and then replacing it to make it very difficult to associate the document with the person. The challenge is having reliable NER and substitution tools without compromising confidentiality and consistency in the document. Most of the conducted research focuses on English medical documents with coarse substitutions by not benefiting from advances in privacy. This paper shows how an efficient and differentially private de-identification approach can be achieved by strengthening the less robust de-identification method and by adapting state-of-the-art differentially private mechanisms for substitution purposes. The result is an approach for de-identifying clinical documents in French language, but also generalizable to other languages and whose robustness is mathematically proven.