论文标题
灾难性的过度拟合可以通过歧视性非舒适特征引起
Catastrophic overfitting can be induced with discriminative non-robust features
论文作者
论文摘要
对抗性训练(AT)是构建强大神经网络的事实上的方法,但在计算上可能很昂贵。为了减轻这种情况,可以使用快速的单步攻击,但这可能导致灾难性的过度拟合(CO)。当网络在AT的第一阶段获得非平凡的鲁棒性时,这种现象就会出现,但随后达到了一个突破点,仅在少数迭代中它们变得脆弱。导致这种失败模式的机制仍然很少了解。在这项工作中,我们通过对自然图像的典型数据集的控制修改来研究CO在方法中的CO发作。特别是,我们表明,仅通过注入具有无害特征的图像,就可以在$ε$值小得多的$ε$值下引起CO。这些特征有助于非鲁斯分类,但不足以独自实现鲁棒性。通过广泛的实验,我们分析了这种新型现象,发现这些简单特征的存在引起了导致CO的学习快捷方式。我们的发现为CO机理提供了新的见解,并提高了我们对AT动力学的理解。可以在https://github.com/gortizji/co_features上找到重现我们实验的代码。
Adversarial training (AT) is the de facto method for building robust neural networks, but it can be computationally expensive. To mitigate this, fast single-step attacks can be used, but this may lead to catastrophic overfitting (CO). This phenomenon appears when networks gain non-trivial robustness during the first stages of AT, but then reach a breaking point where they become vulnerable in just a few iterations. The mechanisms that lead to this failure mode are still poorly understood. In this work, we study the onset of CO in single-step AT methods through controlled modifications of typical datasets of natural images. In particular, we show that CO can be induced at much smaller $ε$ values than it was observed before just by injecting images with seemingly innocuous features. These features aid non-robust classification but are not enough to achieve robustness on their own. Through extensive experiments we analyze this novel phenomenon and discover that the presence of these easy features induces a learning shortcut that leads to CO. Our findings provide new insights into the mechanisms of CO and improve our understanding of the dynamics of AT. The code to reproduce our experiments can be found at https://github.com/gortizji/co_features.