论文标题

用Hessian指示可视化高维损失景观

Visualizing high-dimensional loss landscapes with Hessian directions

论文作者

Böttcher, Lucas, Wheeler, Gregory

论文摘要

分析高维损失函数的几何特性,例如局部曲率以及围绕损失空间某个特定点的其他Optima的存在,可以帮助您更好地理解神经网络结构,实现属性和学习绩效之间的相互作用。在这项工作中,我们将概念从高维概率和差异几何形状结合在一起,以研究低维损耗表示中的曲率特性如何取决于原始损失空间中的曲率。我们表明,如果使用随机投影,则很少在预期的下尺寸表示中正确识别原始空间中的鞍点。预期的低维表示中的主曲率与原始损耗空间中的平均曲率成正比。因此,原始损耗空间中的平均曲率决定了鞍点是否平均显示为minima,maxima或几乎平坦的区域。我们在原始空间(即标准化的Hessian Trace)中使用随机投影和平均曲率之间的预期曲率之间的连接来计算Hutchinson-type Trace痕量估计,而无需像原始Hutchinson方法一样计算Hessian-Vector产品。由于随机预测不适合正确识别马鞍信息,因此我们建议沿着与最大和最小的主要曲线相关的主流Hessian方向进行预测。我们将发现与正在进行的有关损失景观平坦性和普遍性的辩论联系起来。最后,对于不同的常见图像分类器和函数近似器,我们显示并比较损失景观的随机和黑森预测,最高约7美元\ times 10^6 $参数。

Analyzing geometric properties of high-dimensional loss functions, such as local curvature and the existence of other optima around a certain point in loss space, can help provide a better understanding of the interplay between neural network structure, implementation attributes, and learning performance. In this work, we combine concepts from high-dimensional probability and differential geometry to study how curvature properties in lower-dimensional loss representations depend on those in the original loss space. We show that saddle points in the original space are rarely correctly identified as such in expected lower-dimensional representations if random projections are used. The principal curvature in the expected lower-dimensional representation is proportional to the mean curvature in the original loss space. Hence, the mean curvature in the original loss space determines if saddle points appear, on average, as either minima, maxima, or almost flat regions. We use the connection between expected curvature in random projections and mean curvature in the original space (i.e., the normalized Hessian trace) to compute Hutchinson-type trace estimates without calculating Hessian-vector products as in the original Hutchinson method. Because random projections are not suitable to correctly identify saddle information, we propose to study projections along dominant Hessian directions that are associated with the largest and smallest principal curvatures. We connect our findings to the ongoing debate on loss landscape flatness and generalizability. Finally, for different common image classifiers and a function approximator, we show and compare random and Hessian projections of loss landscapes with up to about $7\times 10^6$ parameters.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源