基于残留的自然语言对抗攻击检测

论文标题

基于残留的自然语言对抗攻击检测

Residue-Based Natural Language Adversarial Attack Detection

论文作者

Raina, Vyas, Gales, Mark

论文摘要

基于深度学习的系统容易受到对抗性攻击的影响，在该系统中，输入下的小小的，无法察觉的变化改变了模型的预测。但是，迄今为止，大多数检测这些攻击的方法都是为图像处理系统设计的。许多流行的图像对抗检测方法能够从嵌入特征空间中识别对抗性示例，而在NLP域中，现有最先进的检测方法仅关注输入文本特征，而无需考虑模型嵌入空间。这项工作研究了将这些图像移植到自然语言处理（NLP）任务时会产生的差异 - 发现这些检测器的端口不佳。这是可以预期的，因为NLP系统具有非常不同的输入形式：本质上的离散和顺序，而不是图像的连续和固定尺寸输入。作为一种等效的以模型为重点的NLP检测方法，这项工作提出了一个简单的基于“残基”检测器的句子，以识别对抗性示例。在许多任务上，它超越了绩效的移植图像域检测器和最新的NLP特定探测器的状态。

Deep learning based systems are susceptible to adversarial attacks, where a small, imperceptible change at the input alters the model prediction. However, to date the majority of the approaches to detect these attacks have been designed for image processing systems. Many popular image adversarial detection approaches are able to identify adversarial examples from embedding feature spaces, whilst in the NLP domain existing state of the art detection approaches solely focus on input text features, without consideration of model embedding spaces. This work examines what differences result when porting these image designed strategies to Natural Language Processing (NLP) tasks - these detectors are found to not port over well. This is expected as NLP systems have a very different form of input: discrete and sequential in nature, rather than the continuous and fixed size inputs for images. As an equivalent model-focused NLP detection approach, this work proposes a simple sentence-embedding "residue" based detector to identify adversarial examples. On many tasks, it out-performs ported image domain detectors and recent state of the art NLP specific detectors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题