放射性数据：通过培训进行追踪

论文标题

放射性数据：通过培训进行追踪

Radioactive data: tracing through training

论文作者

Sablayrolles, Alexandre, Douze, Matthijs, Schmid, Cordelia, Jégou, Hervé

论文摘要

我们要检测是否已使用特定的图像数据集来训练模型。我们提出了一种新技术，即\ emph {放射性数据}，该技术使该数据集的更改不可察觉，以便任何受过训练的模型都具有可识别的标记。该标记对于强大的变化（例如不同的体系结构或优化方法）是可靠的。给定训练的模型，我们的技术检测到放射性数据的使用并提供了一定的置信度（P值）。我们在大规模基准测试（ImaveNet）上进行的实验，使用标准体系结构（RESNET-18，VGG-16，Densenet-121）和培训程序，表明，即使只有1％用于培训我们的模型的数据，即使只有1％的数据是放射性的。我们的方法对数据增强和深网优化的随机性具有鲁棒性。结果，它提供的信噪比比数据中毒和后门方法更高。

We want to detect whether a particular image dataset has been used to train a model. We propose a new technique, \emph{radioactive data}, that makes imperceptible changes to this dataset such that any model trained on it will bear an identifiable mark. The mark is robust to strong variations such as different architectures or optimization methods. Given a trained model, our technique detects the use of radioactive data and provides a level of confidence (p-value). Our experiments on large-scale benchmarks (Imagenet), using standard architectures (Resnet-18, VGG-16, Densenet-121) and training procedures, show that we can detect usage of radioactive data with high confidence (p<10^-4) even when only 1% of the data used to trained our model is radioactive. Our method is robust to data augmentation and the stochasticity of deep network optimization. As a result, it offers a much higher signal-to-noise ratio than data poisoning and backdoor methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题