论文标题
用于$α$稳定初始化的Relu神经网络的大差异渐近器
Large-width asymptotics for ReLU neural networks with $α$-Stable initializations
论文作者
论文摘要
高斯神经网络(NNS)的近期且越来越多的文献,即其重量作为高斯分布初始化的NNS。两个流行的问题是:i)研究NN的大宽度分布的研究,其特征在于高斯随机过程的重新定位NN的限制; ii)对NNS的大宽度训练动力学的研究,该动力学以确定性内核(称为神经切线内核(NTK))为特征,该动态的特征是无限宽的动力学,并表明,对于足够大的宽度,梯度下降实现了零训练误差,以线性速率以线性速率达到零训练误差。在本文中,我们考虑了这些问题的$α$ - 稳定性nns,即,其权重以$α$稳定的分布($α\ in(0,2] $)。首先,首先,对于$α$ stable nns具有相关功能关于高斯环境的差异,我们的结果表明,激活函数的选择会影响NN的缩放,也就是说:要实现无限宽的$α$稳定过程,relu激活需要在缩放范围内进行额外的对数术语,从而相对于$ sub-linear的激活,我们会在$ -LINERIDEN的范围内进行$ -SST的动态。随机内核,称为$α$稳定的NTK,表明,对于足够大的宽度,梯度下降以线性速率达到了零训练误差,$α$稳定的NTK的随机性是在与$ a $ stable的范围内的进一步差异。 训练。
There is a recent and growing literature on large-width asymptotic properties of Gaussian neural networks (NNs), namely NNs whose weights are initialized as Gaussian distributions. Two popular problems are: i) the study of the large-width distributions of NNs, which characterizes the infinitely wide limit of a rescaled NN in terms of a Gaussian stochastic process; ii) the study of the large-width training dynamics of NNs, which characterizes the infinitely wide dynamics in terms of a deterministic kernel, referred to as the neural tangent kernel (NTK), and shows that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. In this paper, we consider these problems for $α$-Stable NNs, namely NNs whose weights are initialized as $α$-Stable distributions with $α\in(0,2]$. First, for $α$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $α$-Stable stochastic process. As a difference with respect to the Gaussian setting, our result shows that the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $α$-Stable process, the ReLU activation requires an additional logarithmic term in the scaling with respect to sub-linear activations. Then, we study the large-width training dynamics of $α$-Stable ReLU-NNs, characterizing the infinitely wide dynamics in terms of a random kernel, referred to as the $α$-Stable NTK, and showing that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. The randomness of the $α$-Stable NTK is a further difference with respect to the Gaussian setting, that is: within the $α$-Stable setting, the randomness of the NN at initialization does not vanish in the large-width regime of the training.