论文标题
愚蠢的构成隐秘的偏见抽样
Fool SHAP with Stealthily Biased Sampling
论文作者
论文摘要
SHAP解释旨在确定哪些功能对特定输入与背景分布的模型预测差异最大。最近的研究表明,恶意对手可以操纵它们来产生任意所需的解释。但是,现有的攻击仅着眼于更改黑框模型本身。在本文中,我们提出了一个互补的攻击系列,该家族使用隐秘的偏见采样来使模型完整并操纵形状解释,用于近似于期望的数据点W.R.T w.r.t背景分布。在公平审核的背景下,我们表明我们的攻击可以在解释群体之间的结果差异的同时,同时却保持未被发现时,可以降低敏感功能的重要性。更确切地说,在现实世界数据集上进行的实验表明,我们的攻击可能会产生90 \%的敏感特征归因振幅相对减小。这些结果突出了摇摆解释的可操作性,并鼓励审计师对他们怀疑。
SHAP explanations aim at identifying which features contribute the most to the difference in model prediction at a specific input versus a background distribution. Recent studies have shown that they can be manipulated by malicious adversaries to produce arbitrary desired explanations. However, existing attacks focus solely on altering the black-box model itself. In this paper, we propose a complementary family of attacks that leave the model intact and manipulate SHAP explanations using stealthily biased sampling of the data points used to approximate expectations w.r.t the background distribution. In the context of fairness audit, we show that our attack can reduce the importance of a sensitive feature when explaining the difference in outcomes between groups while remaining undetected. More precisely, experiments performed on real-world datasets showed that our attack could yield up to a 90\% relative decrease in amplitude of the sensitive feature attribution. These results highlight the manipulability of SHAP explanations and encourage auditors to treat them with skepticism.