论文标题

高准确学习所需的无关培训数据的记忆是什么时候?

When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning?

论文作者

Brown, Gavin, Bun, Mark, Feldman, Vitaly, Smith, Adam, Talwar, Kunal

论文摘要

现代机器学习模型是复杂的,并且经常编码有关单个输入的惊人信息。在极端情况下,复杂的模型似乎记住了整个输入示例,包括看似无关紧要的信息(例如,来自文本的社会安全号码)。在本文中,我们旨在了解这种记忆是否对于准确的学习是必要的。我们描述了自然预测问题,其中每个足够准确的训练算法必须在预测模型中编码有关其大量培训示例的所有信息。即使示例是高度的,并且熵远高于样本量,即使大多数信息最终与当前的任务无关,这仍然是正确的。此外,我们的结果不取决于培训算法或用于学习的模型类别。 我们的问题是隔壁预测和集群标签任务的简单且相当自然的变体。这些任务可以看作是文本和图像相关的预测问题的抽象。为了建立我们的结果,我们从一个单向沟通问题的家庭中减少,我们证明了新信息复杂性下限。此外,我们提出了合成数据实验,证明了对逻辑回归和神经网络分类器的成功攻击。

Modern machine learning models are complex and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when the examples are high-dimensional and have entropy much higher than the sample size, and even when most of that information is ultimately irrelevant to the task at hand. Further, our results do not depend on the training algorithm or the class of models used for learning. Our problems are simple and fairly natural variants of the next-symbol prediction and the cluster labeling tasks. These tasks can be seen as abstractions of text- and image-related prediction problems. To establish our results, we reduce from a family of one-way communication problems for which we prove new information complexity lower bounds. Additionally, we present synthetic-data experiments demonstrating successful attacks on logistic regression and neural network classifiers.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源