论文标题
西班牙临床文本中的敏感数据检测和分类:BERT实验
Sensitive Data Detection and Classification in Spanish Clinical Text: Experiments with BERT
论文作者
论文摘要
大规模的数字数据处理提供了广泛的机会和好处,但以危害个人数据隐私为代价。匿名包括从数据中删除或替换敏感信息,在保留个人隐私的同时,为不同的目的而实现其剥削。多年来,已经提出了许多自动匿名系统。但是,根据数据类型,目标语言或培训文档的可用性,任务仍然具有挑战性。在过去两年中,新颖的深度学习模型的出现为自然语言处理领域的最新状态带来了很大的改进。这些进步是由Google在2018年提出的模型以及在数百万个文档中预先培训的共享语言模型最明显的。在本文中,我们使用基于BERT的序列标签模型在西班牙语中的几个临床数据集上进行一系列匿名实验。我们还将BERT与其他算法进行比较。实验表明,具有通用域预训练的简单基于BERT的模型可获得高度竞争的结果,而没有任何特定领域的特征工程。
Massive digital data processing provides a wide range of opportunities and benefits, but at the cost of endangering personal data privacy. Anonymisation consists in removing or replacing sensitive information from data, enabling its exploitation for different purposes while preserving the privacy of individuals. Over the years, a lot of automatic anonymisation systems have been proposed; however, depending on the type of data, the target language or the availability of training documents, the task remains challenging still. The emergence of novel deep-learning models during the last two years has brought large improvements to the state of the art in the field of Natural Language Processing. These advancements have been most noticeably led by BERT, a model proposed by Google in 2018, and the shared language models pre-trained on millions of documents. In this paper, we use a BERT-based sequence labelling model to conduct a series of anonymisation experiments on several clinical datasets in Spanish. We also compare BERT to other algorithms. The experiments show that a simple BERT-based model with general-domain pre-training obtains highly competitive results without any domain specific feature engineering.