良性记忆的好奇案例

论文标题

良性记忆的好奇案例

The Curious Case of Benign Memorization

论文作者

Anagnostidis, Sotiris, Bachmann, Gregor, Noci, Lorenzo, Hofmann, Thomas

论文摘要

尽管在各种学习任务中进行深度学习的经验进步，但我们对其成功的理论理解仍然非常受限制。关键挑战之一是现代模型的过度隔离性质，即使标签是随机的，也可以完全拟合数据，即网络可以完全\ textit {记忆{记忆}所有给定的模式。尽管这种记忆能力似乎令人担忧，但在这项工作中，我们表明，在包括\ textIt {数据增强}的训练方案下，神经网络学会以良性方式记住完全随机的标签，即，它们学习嵌入的嵌入，从而导致在最近的邻居探测下高度非琐碎的性能。我们证明，深层模型具有通过将记忆和特征学习的任务分配到不同层的特征，可以将噪声与信号分开的能力令人惊讶。结果，只有最后一层用于记忆，而之前的图层则编码表现特征，这些特征在很大程度上不受标签噪声的影响。我们探讨了用于培训的增强作用的复杂作用，并从多样性方面确定了记忆将军的权衡，这标志着对所有以前的所有作品都有明显的区别。最后，我们通过证明\ textit {malign}记忆在数据增强下的记忆是不可行的，这是对良性记忆的出现的首次解释，这是由于模型的能力不足而导致增加的样本量。结果，该网络被迫利用增强性的相关性质，因此学习有意义的功能。要完成图片，需要在深层神经网络中进行特征学习的更好理论，以充分理解这种现象的起源。

Despite the empirical advances of deep learning across a variety of learning tasks, our theoretical understanding of its success is still very restricted. One of the key challenges is the overparametrized nature of modern models, enabling complete overfitting of the data even if the labels are randomized, i.e. networks can completely \textit{memorize} all given patterns. While such a memorization capacity seems worrisome, in this work we show that under training protocols that include \textit{data augmentation}, neural networks learn to memorize entirely random labels in a benign way, i.e. they learn embeddings that lead to highly non-trivial performance under nearest neighbour probing. We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers. As a result, only the very last layers are used for memorization, while preceding layers encode performant features which remain largely unaffected by the label noise. We explore the intricate role of the augmentations used for training and identify a memorization-generalization trade-off in terms of their diversity, marking a clear distinction to all previous works. Finally, we give a first explanation for the emergence of benign memorization by showing that \textit{malign} memorization under data augmentation is infeasible due to the insufficient capacity of the model for the increased sample size. As a consequence, the network is forced to leverage the correlated nature of the augmentations and as a result learns meaningful features. To complete the picture, a better theory of feature learning in deep neural networks is required to fully understand the origins of this phenomenon.

下载PDF全文

下载文献需遵守相关版权规定

论文标题