深度学习，随机梯度下降和扩散图

论文标题

深度学习，随机梯度下降和扩散图

Deep learning, stochastic gradient descent and diffusion maps

论文作者

Fjellström, Carmina, Nyström, Kaj

论文摘要

随机梯度下降（SGD）由于其计算效率而被广泛用于深度学习，但对为什么SGD表现如此良好的完全理解仍然是一个重大挑战。从经验上观察到，损失功能的大多数特征值在过度参数的深神经网络的损失景观上接近零，而只有少数特征值很大。零特征值表示沿相应方向的零扩散。这表明最小值选择的过程主要发生在对应于Hessian顶部特征值的相对较低的子空间中。尽管参数空间非常高，但这些发现似乎表明SGD动力学可能主要存在于低维歧管上。在本文中，我们采用了一种真正的数据驱动方法，以解决对高维参数表面的潜在深入了解，尤其是通过分析通过SGD产生的数据或任何其他优化的数据，以发现（可能会发现（可能会发现）低维度的优化景观的景观，从而使SGD所追踪的景观探索。作为探索的车辆，我们使用R. Coifman和合着者引入的扩散图。

Stochastic gradient descent (SGD) is widely used in deep learning due to its computational efficiency, but a complete understanding of why SGD performs so well remains a major challenge. It has been observed empirically that most eigenvalues of the Hessian of the loss functions on the loss landscape of over-parametrized deep neural networks are close to zero, while only a small number of eigenvalues are large. Zero eigenvalues indicate zero diffusion along the corresponding directions. This indicates that the process of minima selection mainly happens in the relatively low-dimensional subspace corresponding to the top eigenvalues of the Hessian. Although the parameter space is very high-dimensional, these findings seems to indicate that the SGD dynamics may mainly live on a low-dimensional manifold. In this paper, we pursue a truly data driven approach to the problem of getting a potentially deeper understanding of the high-dimensional parameter surface, and in particular, of the landscape traced out by SGD by analyzing the data generated through SGD, or any other optimizer for that matter, in order to possibly discover (local) low-dimensional representations of the optimization landscape. As our vehicle for the exploration, we use diffusion maps introduced by R. Coifman and coauthors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题