相干梯度：一种理解基于梯度下降的优化概括的方法

论文标题

相干梯度：一种理解基于梯度下降的优化概括的方法

Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization

论文作者

Chatterjee, Satrajit

论文摘要

深度学习社区中的一个空旷的问题是，为什么通过梯度下降训练的神经网络在真实数据集上也可以很好地推广，即使它们能够拟合随机数据。我们提出了一种基于关于梯度下降动力学的假设来回答这个问题的方法，我们称之为连贯的梯度：类似示例的梯度相似，因此在某些方向上，这些梯度相似，而这些梯度相互加强。因此，在培训期间，网络参数的变化偏向于（本地）在存在这种相似性时受益的许多示例的变化。我们通过启发式论证和扰动实验来支持这一假设，并概述了如何解释有关深度学习的几种常见经验观察。此外，我们的分析不仅具有描述性，而且是规定性的。它表明对梯度下降的自然修改可以大大减少过度拟合。

An open question in the Deep Learning community is why neural networks trained with Gradient Descent generalize well on real datasets even though they are capable of fitting random data. We propose an approach to answering this question based on a hypothesis about the dynamics of gradient descent that we call Coherent Gradients: Gradients from similar examples are similar and so the overall gradient is stronger in certain directions where these reinforce each other. Thus changes to the network parameters during training are biased towards those that (locally) simultaneously benefit many examples when such similarity exists. We support this hypothesis with heuristic arguments and perturbative experiments and outline how this can explain several common empirical observations about Deep Learning. Furthermore, our analysis is not just descriptive, but prescriptive. It suggests a natural modification to gradient descent that can greatly reduce overfitting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题