自然主义者：自然主义对手能像人造对手一样有效吗？

论文标题

自然主义者：自然主义对手能像人造对手一样有效吗？

NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as Artificial Adversaries?

论文作者

Gabriel, Saadia, Palangi, Hamid, Choi, Yejin

论文摘要

尽管大量先前的工作已经探索了自然语言理解任务的对抗性示例，但这些示例通常是不现实的，并且与现实世界中的数据分布有所不同。在这项工作中，我们介绍了一个两阶段的对抗示例生成框架（自然厌恶），用于设计有效欺骗给定分类器的对手，并展示了在模型内部署期间可能发生的自然失败案例。在第一阶段，代币归因方法用于汇总给定分类器的行为，作为输入中的密钥令牌的函数。在第二阶段，生成模型从第一阶段开始在钥匙令牌上进行条件。天然厌恶者根据对模型参数的访问级别适应黑框和白框对抗攻击。我们的结果表明，这些对手在跨领域概括，并提供有关改善神经文本分类模型鲁棒性的未来研究的见解。

While a substantial body of prior work has explored adversarial example generation for natural language understanding tasks, these examples are often unrealistic and diverge from the real-world data distributions. In this work, we introduce a two-stage adversarial example generation framework (NaturalAdversaries), for designing adversaries that are effective at fooling a given classifier and demonstrate natural-looking failure cases that could plausibly occur during in-the-wild deployment of the models. At the first stage a token attribution method is used to summarize a given classifier's behaviour as a function of the key tokens in the input. In the second stage a generative model is conditioned on the key tokens from the first stage. NaturalAdversaries is adaptable to both black-box and white-box adversarial attacks based on the level of access to the model parameters. Our results indicate these adversaries generalize across domains, and offer insights for future research on improving robustness of neural text classification models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题