论文标题
Pidan:深层神经网络中后门攻击检测和缓解的连贯优化方法
PiDAn: A Coherence Optimization Approach for Backdoor Attack Detection and Mitigation in Deep Neural Networks
论文作者
论文摘要
后门攻击在深度神经网络(DNNS)中施加了新的威胁,在该威胁中,后门通过中毒训练数据集插入神经网络,错误分类包含对手触发的输入。防御这些攻击的主要挑战是,只有攻击者知道秘密触发器和目标类别。最近引入“隐藏触发器”,该问题进一步加剧了问题,在该引入中,触发器被仔细地融合到输入中,绕过人类检查的检测并通过异常检测导致后门识别失败。为了防止这种不可察觉的攻击,在这项工作中,我们系统地分析了在使用训练数据作为输入时的表示形式,即给定DNN的神经元激活集,受到后门攻击的影响。我们提出了Pidan,这是一种基于相干优化净化中毒数据的算法。我们的分析表明,目标类别中有毒数据和真实数据的表示仍然嵌入到不同的线性子空间中,这意味着它们与某些潜在空间显示出不同的连贯性。基于这一观察结果,提出的Pidan算法学习了样本的权重矢量,以最大程度地提高加权样品的预计相干性,在这里我们证明,学到的权重矢量具有自然的“分组效应”,并且在真实数据和中毒数据之间是可区分的。这使后门攻击的系统检测和缓解。基于我们的理论分析和实验结果,我们证明了Pidan在防御后门攻击方面的有效性,这些攻击使用GTSRB和ILSVRC2012数据集中使用中毒样品的不同设置。我们的Pidan算法可以检测到90%以上的感染类别,并鉴定95%的中毒样本。
Backdoor attacks impose a new threat in Deep Neural Networks (DNNs), where a backdoor is inserted into the neural network by poisoning the training dataset, misclassifying inputs that contain the adversary trigger. The major challenge for defending against these attacks is that only the attacker knows the secret trigger and the target class. The problem is further exacerbated by the recent introduction of "Hidden Triggers", where the triggers are carefully fused into the input, bypassing detection by human inspection and causing backdoor identification through anomaly detection to fail. To defend against such imperceptible attacks, in this work we systematically analyze how representations, i.e., the set of neuron activations for a given DNN when using the training data as inputs, are affected by backdoor attacks. We propose PiDAn, an algorithm based on coherence optimization purifying the poisoned data. Our analysis shows that representations of poisoned data and authentic data in the target class are still embedded in different linear subspaces, which implies that they show different coherence with some latent spaces. Based on this observation, the proposed PiDAn algorithm learns a sample-wise weight vector to maximize the projected coherence of weighted samples, where we demonstrate that the learned weight vector has a natural "grouping effect" and is distinguishable between authentic data and poisoned data. This enables the systematic detection and mitigation of backdoor attacks. Based on our theoretical analysis and experimental results, we demonstrate the effectiveness of PiDAn in defending against backdoor attacks that use different settings of poisoned samples on GTSRB and ILSVRC2012 datasets. Our PiDAn algorithm can detect more than 90% infected classes and identify 95% poisoned samples.