梯度下降的隐式偏见，用于训练有物流损失的宽两层神经网络

论文标题

梯度下降的隐式偏见，用于训练有物流损失的宽两层神经网络

Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

论文作者

Chizat, Lenaic, Bach, Francis

论文摘要

观察到，通过基于梯度的方法，可以在许多监督分类任务中表现良好的训练的神经网络，以最大程度地减少物流（又称跨透明术）损失。为了理解这种现象，我们分析了具有均匀激活的无限宽两层神经网络的训练和概括行为。我们表明，在某些非希尔伯特式功能空间中，梯度流的梯度流量的极限可以被完全表征为最大边缘分类器。在存在隐藏的低维结构的情况下，产生的边缘独立于Ambiant维度，这导致了强大的概括界限。相比之下，仅训练输出层隐式求解内核支持向量机，而先验的媒介机不具有这种适应性。在运行时间方面，我们对培训的分析是非定量性的，但是我们通过显示与在线镜像下降的等价来证明了在简化设置中的计算保证。最后，数值实验表明，我们的分析很好地描述了具有RELU激活的两层神经网络的实际行为，并确认了这种隐式偏见的统计益处。

Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias.

下载PDF全文

下载文献需遵守相关版权规定

论文标题