论文标题
法律自然语言处理的隐私模型
Privacy-Preserving Models for Legal Natural Language Processing
论文作者
论文摘要
具有内域数据的预训练大型变压器模型可改善域的适应性,并有助于在特定于域的下游任务上获得性能。但是,共享在潜在敏感数据上预先训练的模型容易出现对抗性隐私攻击。在本文中,我们询问了我们可以保证培训前数据的隐私,同时在不需要其他标记数据的情况下实现更好的下游绩效。我们在正式的差异隐私范式下对变压器模型的可扩展自我监督学习进行了广泛的实验,并表明在特定的培训配置下,我们可以改善下游性能,而无需牺牲对内域数据的隐私保护。我们的主要贡献是利用差异隐私来对法律NLP领域中变压器语言模型进行大规模预培训,据我们所知,该模型以前尚未得到解决。
Pre-training large transformer models with in-domain data improves domain adaptation and helps gain performance on the domain-specific downstream tasks. However, sharing models pre-trained on potentially sensitive data is prone to adversarial privacy attacks. In this paper, we asked to which extent we can guarantee privacy of pre-training data and, at the same time, achieve better downstream performance on legal tasks without the need of additional labeled data. We extensively experiment with scalable self-supervised learning of transformer models under the formal paradigm of differential privacy and show that under specific training configurations we can improve downstream performance without sacrifying privacy protection for the in-domain data. Our main contribution is utilizing differential privacy for large-scale pre-training of transformer language models in the legal NLP domain, which, to the best of our knowledge, has not been addressed before.