用于$α$稳定初始化的Relu神经网络的大差异渐近器

论文标题

用于$α$稳定初始化的Relu神经网络的大差异渐近器

Large-width asymptotics for ReLU neural networks with $α$-Stable initializations

论文作者

Favaro, Stefano, Fortini, Sandra, Peluchetti, Stefano

论文摘要

高斯神经网络（NNS）的近期且越来越多的文献，即其重量作为高斯分布初始化的NNS。两个流行的问题是：i）研究NN的大宽度分布的研究，其特征在于高斯随机过程的重新定位NN的限制； ii）对NNS的大宽度训练动力学的研究，该动力学以确定性内核（称为神经切线内核（NTK））为特征，该动态的特征是无限宽的动力学，并表明，对于足够大的宽度，梯度下降实现了零训练误差，以线性速率以线性速率达到零训练误差。在本文中，我们考虑了这些问题的$α$ - 稳定性nns，即，其权重以$α$稳定的分布（$α\ in（0,2] $）。首先，首先，对于$α$ stable nns具有相关功能关于高斯环境的差异，我们的结果表明，激活函数的选择会影响NN的缩放，也就是说：要实现无限宽的$α$稳定过程，relu激活需要在缩放范围内进行额外的对数术语，从而相对于$ sub-linear的激活，我们会在$ -LINERIDEN的范围内进行$ -SST的动态。随机内核，称为$α$稳定的NTK，表明，对于足够大的宽度，梯度下降以线性速率达到了零训练误差，$α$稳定的NTK的随机性是在与$ a $ stable的范围内的进一步差异。训练。

There is a recent and growing literature on large-width asymptotic properties of Gaussian neural networks (NNs), namely NNs whose weights are initialized as Gaussian distributions. Two popular problems are: i) the study of the large-width distributions of NNs, which characterizes the infinitely wide limit of a rescaled NN in terms of a Gaussian stochastic process; ii) the study of the large-width training dynamics of NNs, which characterizes the infinitely wide dynamics in terms of a deterministic kernel, referred to as the neural tangent kernel (NTK), and shows that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. In this paper, we consider these problems for $α$-Stable NNs, namely NNs whose weights are initialized as $α$-Stable distributions with $α\in(0,2]$. First, for $α$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $α$-Stable stochastic process. As a difference with respect to the Gaussian setting, our result shows that the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $α$-Stable process, the ReLU activation requires an additional logarithmic term in the scaling with respect to sub-linear activations. Then, we study the large-width training dynamics of $α$-Stable ReLU-NNs, characterizing the infinitely wide dynamics in terms of a random kernel, referred to as the $α$-Stable NTK, and showing that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. The randomness of the $α$-Stable NTK is a further difference with respect to the Gaussian setting, that is: within the $α$-Stable setting, the randomness of the NN at initialization does not vanish in the large-width regime of the training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题