Feed向前神经网络中的活动重量双重性：概括的几何决定因素

论文标题

Feed向前神经网络中的活动重量双重性：概括的几何决定因素

The activity-weight duality in feed forward neural networks: The geometric determinants of generalization

论文作者

Feng, Yu, Tu, Yuhai

论文摘要

机器学习中的基本问题之一是概括。在具有大量权重（参数）的神经网络模型中，可以发现许多解决方案同样适合训练数据。关键问题是哪种解决方案可以描述未在培训集中的测试数据。在这里，我们报告了给定神经元中活动的变化与在任何feed向前神经网络中密度连接的层中连接到下一层神经元的重量的变化之间的确切双重性（等效性）。活动重量（A-W）二重性使我们能够将输入（数据）中的变化映射到相应的双重权重的变化。通过使用此映射，我们表明，可以将概括分解为来自重量空间中溶液损耗函数的Hessian矩阵的不同特征矩阵的贡献之和。给定的本征方向的贡献是两个几何因素（决定因素）的乘积：损失格局的清晰度和双重权重的标准偏差，发现随溶液的重量标准而扩展。我们的结果提供了一个统一的框架，我们用来揭示不同的正则化方案（重量衰减，随机梯度下降，具有不同的批次大小和学习率，辍学率），训练数据大小以及标记噪声噪声通过控制这两个几何确定因素的一般化来影响普遍性的性能。这些见解可用于指导算法的开发，以在过度散热的神经网络中找到更多可推广的解决方案。

One of the fundamental problems in machine learning is generalization. In neural network models with a large number of weights (parameters), many solutions can be found to fit the training data equally well. The key question is which solution can describe testing data not in the training set. Here, we report the discovery of an exact duality (equivalence) between changes in activities in a given layer of neurons and changes in weights that connect to the next layer of neurons in a densely connected layer in any feed forward neural network. The activity-weight (A-W) duality allows us to map variations in inputs (data) to variations of the corresponding dual weights. By using this mapping, we show that the generalization loss can be decomposed into a sum of contributions from different eigen-directions of the Hessian matrix of the loss function at the solution in weight space. The contribution from a given eigen-direction is the product of two geometric factors (determinants): the sharpness of the loss landscape and the standard deviation of the dual weights, which is found to scale with the weight norm of the solution. Our results provide an unified framework, which we used to reveal how different regularization schemes (weight decay, stochastic gradient descent with different batch sizes and learning rates, dropout), training data size, and labeling noise affect generalization performance by controlling either one or both of these two geometric determinants for generalization. These insights can be used to guide development of algorithms for finding more generalizable solutions in overparametrized neural networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题