论文标题
无限宽度神经网络中的特征学习
Feature Learning in Infinite-Width Neural Networks
论文作者
论文摘要
由于其宽度趋于无穷大,如果梯度下降下的深度神经网络的行为可以简化和可预测(例如,神经切线核(NTK)给出,如果它是适当的参数化(例如,NTK参数化)。但是,我们表明,神经网络的标准和NTK参数化不接受可以学习特征的无限宽度限制,这对于训练和转移学习至关重要。我们对标准参数化提出了简单的修改,以允许在极限内进行特征学习。使用 * Tensor程序 *技术,我们为这些限制提供了明确的公式。在Word2Vec和Omniglot上通过MAML进行的几个射击学习,这是两个依赖特征学习的规范任务,我们准确地计算了这些限制。我们发现它们的表现都优于NTK基准和有限宽度网络,后者接近无限宽度的特征学习表现,随着宽度的增加。 更一般而言,我们对神经网络参数化的自然空间进行了分类,该空间概括了标准,NTK和平均场参数化。我们显示1)该空间中的任何参数化要么接受特征学习,要么具有内核梯度下降给出的无限宽度训练动力学,但两者却不是两者。 2)可以使用Tensor程序技术计算任何此类无限宽度限制。可以在github.com/edwardjhu/tp4上找到我们实验的代码。
As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g. given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g. the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the standard parametrization to allow for feature learning in the limit. Using the *Tensor Programs* technique, we derive explicit formulas for such limits. On Word2Vec and few-shot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, we compute these limits exactly. We find that they outperform both NTK baselines and finite-width networks, with the latter approaching the infinite-width feature learning performance as width increases. More generally, we classify a natural space of neural network parametrizations that generalizes standard, NTK, and Mean Field parametrizations. We show 1) any parametrization in this space either admits feature learning or has an infinite-width training dynamics given by kernel gradient descent, but not both; 2) any such infinite-width limit can be computed using the Tensor Programs technique. Code for our experiments can be found at github.com/edwardjhu/TP4.