论文标题
获得A轮的保证:对认证鲁棒性的浮点攻击
Getting a-Round Guarantees: Floating-Point Attacks on Certified Robustness
论文作者
论文摘要
对抗性示例会带来安全风险,因为它们可以通过轻微的输入扰动来改变机器学习分类器的决策。在给定输入$ \ mathbf {x} $的情况下,已提出认证的鲁棒性作为缓解措施,分类器返回预测和认证的半径$ r $,并证明可以通过$ \ mathbf {x} $使用$ r $ bunged norm的扰动来改变分类器的预测。在这项工作中,我们表明,由于浮点数表示会导致舍入错误的局限性,这些保证可能会无效。我们设计了一种舍入搜索方法,可以在两个威胁模型中有效利用这种脆弱性来针对最先进的认证找到对抗性示例,这在计算扰动的规范方面有所不同。我们表明,攻击可以针对具有准确认证保证的线性分类器以及具有保守认证的神经网络的线性分类器进行。在薄弱的威胁模型中,我们的实验证明了在随机线性分类器上的攻击成功率超过50%,线性SVM的MNIST数据集最多可达23%,而神经网络的攻击率最高为15%。在强大的威胁模型中,成功率较低,但积极。我们攻击所利用的浮点错误的范围可以从小到大(例如,$ 10^{ - 13} $到$ 10^{3} $) - 表明即使是可忽略的错误也可以系统地利用,以使经认证的鲁棒性提供的保证无效。最后,我们提出了一种基于圆形间隔算术的正式缓解方法,鼓励未来的鲁棒性证书实施,以说明现代计算体系结构的局限性,以提供可靠的可认证保证。
Adversarial examples pose a security risk as they can alter decisions of a machine learning classifier through slight input perturbations. Certified robustness has been proposed as a mitigation where given an input $\mathbf{x}$, a classifier returns a prediction and a certified radius $R$ with a provable guarantee that any perturbation to $\mathbf{x}$ with $R$-bounded norm will not alter the classifier's prediction. In this work, we show that these guarantees can be invalidated due to limitations of floating-point representation that cause rounding errors. We design a rounding search method that can efficiently exploit this vulnerability to find adversarial examples against state-of-the-art certifications in two threat models, that differ in how the norm of the perturbation is computed. We show that the attack can be carried out against linear classifiers that have exact certifiable guarantees and against neural networks that have conservative certifications. In the weak threat model, our experiments demonstrate attack success rates over 50% on random linear classifiers, up to 23% on the MNIST dataset for linear SVM, and up to 15% for a neural network. In the strong threat model, the success rates are lower but positive. The floating-point errors exploited by our attacks can range from small to large (e.g., $10^{-13}$ to $10^{3}$) - showing that even negligible errors can be systematically exploited to invalidate guarantees provided by certified robustness. Finally, we propose a formal mitigation approach based on rounded interval arithmetic, encouraging future implementations of robustness certificates to account for limitations of modern computing architecture to provide sound certifiable guarantees.