对抗性水印变压器：朝着数据隐藏数据来追踪文本出处

论文标题

对抗性水印变压器：朝着数据隐藏数据来追踪文本出处

Adversarial Watermarking Transformer: Towards Tracing Text Provenance with Data Hiding

论文作者

Abdelnabi, Sahar, Fritz, Mario

论文摘要

自然语言生成的最新进展引入了具有高质量输出文本的强大语言模型。但是，这引起了人们对这种模型的潜在滥用恶意目的的关注。在本文中，我们将自然语言水印作为防御，以帮助更好地标记和追踪文本的出处。我们介绍了对抗性水印变压器（AWT），并通过训练有素的编码器编码器和对抗性训练，给定输入文本和二进制消息，生成了一个未经夸张地编码给定消息的输出文本。我们进一步研究了不同的培训和推理策略，以实现对输入文本的语义和正确性的最小变化。 AWT是第一个通过自动学习（没有地面真相）以及其位置以编码消息来隐藏文本中数据的第一个端到端模型。我们从经验上表明，我们的模型可以在很大程度上保存文本实用程序并解码水印，同时将其掩盖对手的存在。此外，我们证明了我们的方法对一系列攻击具有鲁棒性。

Recent advances in natural language generation have introduced powerful language models with high-quality output text. However, this raises concerns about the potential misuse of such models for malicious purposes. In this paper, we study natural language watermarking as a defense to help better mark and trace the provenance of text. We introduce the Adversarial Watermarking Transformer (AWT) with a jointly trained encoder-decoder and adversarial training that, given an input text and a binary message, generates an output text that is unobtrusively encoded with the given message. We further study different training and inference strategies to achieve minimal changes to the semantics and correctness of the input text. AWT is the first end-to-end model to hide data in text by automatically learning -- without ground truth -- word substitutions along with their locations in order to encode the message. We empirically show that our model is effective in largely preserving text utility and decoding the watermark while hiding its presence against adversaries. Additionally, we demonstrate that our method is robust against a range of attacks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题