论文标题
在信息理论镜头下SDE的两个方面:通过训练轨迹和终端状态对SGD的概括
Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States
论文作者
论文摘要
最近已经显示出随机微分方程(SDE),以很好地表征使用SGD的训练机学习模型的动力学。当SDE近似的概括误差与预期的SGD的概述紧密一致时,它提供了两个机会,可以通过其SDE近似更好地理解SGD的概括行为。首先,将SGD视为具有高斯梯度噪声的全零件梯度下降,使我们能够使用Xu和Raginsky的信息理论结合获得基于轨迹的概括[2017]。其次,假设条件温和,我们估计了SDE的稳态重量分布,并使用Xu和Raginsky [2017]和Negrea等人的信息理论界限。 [2019]建立基于终端国家的概括范围。我们提出的界限具有一些优势,特别是基于轨迹的界限跑赢大盘在Wang和Mao [2022]中产生,基于终端状态的界限表现出与基于稳定性的界限相当的快速衰减率。
Stochastic differential equations (SDEs) have been shown recently to characterize well the dynamics of training machine learning models with SGD. When the generalization error of the SDE approximation closely aligns with that of SGD in expectation, it provides two opportunities for understanding better the generalization behaviour of SGD through its SDE approximation. Firstly, viewing SGD as full-batch gradient descent with Gaussian gradient noise allows us to obtain trajectory-based generalization bound using the information-theoretic bound from Xu and Raginsky [2017]. Secondly, assuming mild conditions, we estimate the steady-state weight distribution of SDE and use information-theoretic bounds from Xu and Raginsky [2017] and Negrea et al. [2019] to establish terminal-state-based generalization bounds. Our proposed bounds have some advantages, notably the trajectory-based bound outperforms results in Wang and Mao [2022], and the terminal-state-based bound exhibits a fast decay rate comparable to stability-based bounds.