论文标题
比较基于规则的,基于特征和深层神经方法来识别荷兰病历
Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records
论文作者
论文摘要
电子健康记录中的非结构化信息为医学研究提供了宝贵的资源。为了保护患者的机密性并遵守隐私法规,去识别方法会自动从这些医疗记录中删除个人识别信息。但是,由于标记数据的不可用,大多数现有研究都限制在英语医学文本上,并且对跨语言和域的去识别方法的普遍性知之甚少。在这项研究中,我们构建了一个不同的数据集,该数据集由来自1260名患者的医疗记录组成,通过对9家机构和荷兰医疗保健的三个领域进行采样。我们测试了跨语言和域的三种去识别方法的普遍性。我们的实验表明,专门为荷兰语开发的现有基于规则的方法未能推广到此新数据。此外,即使培训数据有限,最先进的神经体系结构在跨语言和领域也表现出色。与基于特征和基于规则的方法相比,神经方法需要较少的配置工作和域知识。我们将所有代码和预训练的去识别模型都可用于研究社区,使从业者可以将其应用于其数据集并实现未来的基准。
Unstructured information in electronic health records provide an invaluable resource for medical research. To protect the confidentiality of patients and to conform to privacy regulations, de-identification methods automatically remove personally identifying information from these medical records. However, due to the unavailability of labeled data, most existing research is constrained to English medical text and little is known about the generalizability of de-identification methods across languages and domains. In this study, we construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare. We test the generalizability of three de-identification methods across languages and domains. Our experiments show that an existing rule-based method specifically developed for the Dutch language fails to generalize to this new data. Furthermore, a state-of-the-art neural architecture performs strongly across languages and domains, even with limited training data. Compared to feature-based and rule-based methods the neural method requires significantly less configuration effort and domain-knowledge. We make all code and pre-trained de-identification models available to the research community, allowing practitioners to apply them to their datasets and to enable future benchmarks.