论文标题
标签保存的短语级文本对抗攻击
Phrase-level Textual Adversarial Attack with Label Preservation
论文作者
论文摘要
产生高质量的文本对抗示例对于研究自然语言处理(NLP)模型的陷阱至关重要,并进一步促进其鲁棒性。现有的攻击通常是通过单词级别或句子级扰动来实现的,这些攻击限制了扰动空间,或者牺牲流利性和文本质量,都影响了攻击效果。在本文中,我们提出了短语级文本对抗攻击(PLAT),该攻击(PLAP)通过短语级扰动生成对抗样本。平台首先将弱势短语作为句法解析器作为攻击目标提取,然后通过预先训练的空白注入模型将其掩盖。这种灵活的扰动设计基本上扩展了搜索空间,以进行更有效的攻击,而不会引入太多修改,同时使用周围文本通过上下文化的生成来维持文本流利和语法。此外,我们开发了一个标签保护过滤器,利用每个类而不是文本相似性微调语言模型的可能性,以排除那些可能改变人类原始类标签的扰动。广泛的实验和人类评估表明,与强基础相比,PLAP具有出色的攻击效果,并且具有更好的标签一致性。
Generating high-quality textual adversarial examples is critical for investigating the pitfalls of natural language processing (NLP) models and further promoting their robustness. Existing attacks are usually realized through word-level or sentence-level perturbations, which either limit the perturbation space or sacrifice fluency and textual quality, both affecting the attack effectiveness. In this paper, we propose Phrase-Level Textual Adversarial aTtack (PLAT) that generates adversarial samples through phrase-level perturbations. PLAT first extracts the vulnerable phrases as attack targets by a syntactic parser, and then perturbs them by a pre-trained blank-infilling model. Such flexible perturbation design substantially expands the search space for more effective attacks without introducing too many modifications, and meanwhile maintaining the textual fluency and grammaticality via contextualized generation using surrounding texts. Moreover, we develop a label-preservation filter leveraging the likelihoods of language models fine-tuned on each class, rather than textual similarity, to rule out those perturbations that potentially alter the original class label for humans. Extensive experiments and human evaluation demonstrate that PLAT has a superior attack effectiveness as well as a better label consistency than strong baselines.