论文标题
平衡的对抗训练:在NLP模型中平衡善变与顽固性之间的权衡平衡
Balanced Adversarial Training: Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models
论文作者
论文摘要
传统的(善变)对抗性示例涉及找到一个小的扰动,该扰动不会改变输入的真实标签,但会使分类器混淆输出不同的预测。相反,当对手发现一个小的扰动可以保留分类器的预测但会改变输入的真实标签时,就会发生固执的对抗示例。对抗性训练和经过认证的鲁棒训练在改善机器学习模型的稳健性方面表现出一些有效性,以弥补对抗性的例子。我们表明,标准的对抗训练方法着重于减少易受斑点的对抗性示例的脆弱性,可能会使模型更容易受到固执的对抗例子的影响,并进行了自然语言推断和释义识别任务的实验。为了应对这一现象,我们引入了平衡的对抗训练,该训练结合了对比度学习,以提高对善变和顽固的对抗性例子的鲁棒性。
Traditional (fickle) adversarial examples involve finding a small perturbation that does not change an input's true label but confuses the classifier into outputting a different prediction. Conversely, obstinate adversarial examples occur when an adversary finds a small perturbation that preserves the classifier's prediction but changes the true label of an input. Adversarial training and certified robust training have shown some effectiveness in improving the robustness of machine learnt models to fickle adversarial examples. We show that standard adversarial training methods focused on reducing vulnerability to fickle adversarial examples may make a model more vulnerable to obstinate adversarial examples, with experiments for both natural language inference and paraphrase identification tasks. To counter this phenomenon, we introduce Balanced Adversarial Training, which incorporates contrastive learning to increase robustness against both fickle and obstinate adversarial examples.