深度学习中梯度差异的研究

论文标题

深度学习中梯度差异的研究

A Study of Gradient Variance in Deep Learning

论文作者

Faghri, Fartash, Duvenaud, David, Fleet, David J., Ba, Jimmy

论文摘要

梯度噪声对训练深层模型的影响得到了广泛的认可，但不太了解。在这种情况下，我们研究训练过程中梯度的分布。我们介绍了一种梯度聚类，以最大程度地减少平均微型梯度梯度的方差，并分层采样。我们证明，如果从梯度空间中的加权聚类中采样元素，则平均微型批次梯度的方差将最小化。我们衡量了共同深度学习基准的梯度差异，并观察到，与共同的假设相反，培训期间梯度差异的增加，而学习率较小，与较高的差异相吻合。此外，我们引入了归一化梯度方差，作为一种统计量，与梯度方差相比，它更好地与收敛速度相关。

The impact of gradient noise on training deep models is widely acknowledged but not well understood. In this context, we study the distribution of gradients during training. We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling. We prove that the variance of average mini-batch gradient is minimized if the elements are sampled from a weighted clustering in the gradient space. We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training, and smaller learning rates coincide with higher variance. In addition, we introduce normalized gradient variance as a statistic that better correlates with the speed of convergence compared to gradient variance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题