从模型解释中推断敏感属性

论文标题

从模型解释中推断敏感属性

Inferring Sensitive Attributes from Model Explanations

论文作者

Duddu, Vasisht, Boutet, Antoine

论文摘要

模型说明为训练有素的机器学习模型的黑框行为提供了透明度，向模型构建器提供了透明度。它们表明了不同输入属性对其相应模型预测的影响。对输入的解释的依赖性引发了敏感用户数据的隐私问题。但是，当前的文献对模型解释的隐私风险的讨论有限。我们专注于属性推理攻击的特定隐私风险，其中敌人会在其模型解释的情况下侵入输入的敏感属性（例如种族和性别）。我们设计了针对模型解释的第一个属性推理攻击在两个威胁模型中，模型构建器（a）都包含训练数据和输入中的敏感属性，或者（（b）审查敏感属性，不在培训数据和输入中。我们评估了对四个基准数据集和四种最先进算法的拟议攻击。我们表明，对手可以准确地从两个威胁模型中的解释中成功推断出敏感属性的价值。此外，即使仅利用与敏感属性相对应的解释，攻击也是成功的。这些表明我们的攻击是有效的，对解释有效，并对数据隐私构成了实际威胁。在将模型预测（由先前攻击所利用的攻击表面）与解释相结合时，我们注意到攻击成功并不能改善。此外，与仅利用模型预测相比，利用模型解释的攻击成功更好。这些表明模型解释是为对手开发的强大攻击表面。

Model explanations provide transparency into a trained machine learning model's blackbox behavior to a model builder. They indicate the influence of different input attributes to its corresponding model prediction. The dependency of explanations on input raises privacy concerns for sensitive user data. However, current literature has limited discussion on privacy risks of model explanations. We focus on the specific privacy risk of attribute inference attack wherein an adversary infers sensitive attributes of an input (e.g., race and sex) given its model explanations. We design the first attribute inference attack against model explanations in two threat models where model builder either (a) includes the sensitive attributes in training data and input or (b) censors the sensitive attributes by not including them in the training data and input. We evaluate our proposed attack on four benchmark datasets and four state-of-the-art algorithms. We show that an adversary can successfully infer the value of sensitive attributes from explanations in both the threat models accurately. Moreover, the attack is successful even by exploiting only the explanations corresponding to sensitive attributes. These suggest that our attack is effective against explanations and poses a practical threat to data privacy. On combining the model predictions (an attack surface exploited by prior attacks) with explanations, we note that the attack success does not improve. Additionally, the attack success on exploiting model explanations is better compared to exploiting only model predictions. These suggest that model explanations are a strong attack surface to exploit for an adversary.

下载PDF全文

下载文献需遵守相关版权规定

论文标题