论文标题
制定鲁棒性,以防止不可预见的攻击
Formulating Robustness Against Unforeseen Attacks
论文作者
论文摘要
现有针对对抗性示例(例如对抗训练)的防御措施通常假定对手将符合特定或已知的威胁模型,例如固定预算内的$ \ ell_p $扰动。在本文中,我们关注的是在训练过程中辩方假设的威胁模型中存在不匹配的情况,以及在测试时对手的实际功能。我们提出一个问题:学习者是否会针对特定的“源”威胁模型进行训练,何时我们可以期望鲁棒性在测试时间期间概括为更强大的未知“目标”威胁模型?我们的主要贡献是通过不可预见的对手正式定义学习和概括的问题,这有助于我们从常规的对手的传统角度来理解对抗风险的增加。应用我们的框架,我们得出了一个概括结合,该概括将源和目标威胁模型之间的概括差距与特征提取器的变化相关联,该限制衡量了在给定威胁模型中提取的特征之间的预期最大差异。基于我们的概括结合,我们提出了变化正则化(VR),该变化降低了训练期间源威胁模型中特征提取器的变化。我们从经验上证明,使用VR可以改善在测试时间期间的概括,并在测试时间内发作,并将VR与知觉对抗训练(Laidlaw等,2021)相结合,从而实现了对不可预见攻击的最先进的鲁棒性。我们的代码可在https://github.com/inspire-group/variation-regularization上公开获取。
Existing defenses against adversarial examples such as adversarial training typically assume that the adversary will conform to a specific or known threat model, such as $\ell_p$ perturbations within a fixed budget. In this paper, we focus on the scenario where there is a mismatch in the threat model assumed by the defense during training, and the actual capabilities of the adversary at test time. We ask the question: if the learner trains against a specific "source" threat model, when can we expect robustness to generalize to a stronger unknown "target" threat model during test-time? Our key contribution is to formally define the problem of learning and generalization with an unforeseen adversary, which helps us reason about the increase in adversarial risk from the conventional perspective of a known adversary. Applying our framework, we derive a generalization bound which relates the generalization gap between source and target threat models to variation of the feature extractor, which measures the expected maximum difference between extracted features across a given threat model. Based on our generalization bound, we propose variation regularization (VR) which reduces variation of the feature extractor across the source threat model during training. We empirically demonstrate that using VR can lead to improved generalization to unforeseen attacks during test-time, and combining VR with perceptual adversarial training (Laidlaw et al., 2021) achieves state-of-the-art robustness on unforeseen attacks. Our code is publicly available at https://github.com/inspire-group/variation-regularization.