在不均匀平滑度下，亚当可证明的适应性

论文标题

在不均匀平滑度下，亚当可证明的适应性

Provable Adaptivity of Adam under Non-uniform Smoothness

论文作者

Wang, Bohan, Zhang, Yushun, Zhang, Huishuai, Meng, Qi, Sun, Ruoyu, Ma, Zhi-Ming, Liu, Tie-Yan, Luo, Zhi-Quan, Chen, Wei

论文摘要

由于其快速收敛，亚当在实际应用中广泛采用。但是，其理论分析仍然远非令人满意。亚当的现有收敛分析依赖于有界的平滑度假设，称为\ emph {l-smooth条件}。不幸的是，这个假设不适合许多深度学习任务。此外，我们认为，这种假设掩盖了亚当的真正好处，因为该算法可以根据当地的平滑度调整其更新幅度。当假设全球界限时，亚当的这一重要特征变得无关紧要。本文研究了随机改组的亚当（RR Adam）的收敛，学习率降低，这是深度学习任务中采用的亚当的主要版本。我们介绍了RR ADAM的第一个收敛分析，而没有有界的平滑度假设。我们证明，当平滑度通过梯度标准线性界限时，RR Adam可以保持其收敛属性，称为\ emph {$（l_0，l_1）$ - 平滑条件。当两种方法都使用学习率降低时，我们将进一步比较亚当与SGD。我们完善了SGD的现有下限，并表明SGD可以比Adam慢。据我们所知，这是亚当和SGD在同一环境中进行严格比较的第一次，并且揭示了亚当的优势。

Adam is widely adopted in practical applications due to its fast convergence. However, its theoretical analysis is still far from satisfactory. Existing convergence analyses for Adam rely on the bounded smoothness assumption, referred to as the \emph{L-smooth condition}. Unfortunately, this assumption does not hold for many deep learning tasks. Moreover, we believe that this assumption obscures the true benefit of Adam, as the algorithm can adapt its update magnitude according to local smoothness. This important feature of Adam becomes irrelevant when assuming globally bounded smoothness. This paper studies the convergence of randomly reshuffled Adam (RR Adam) with diminishing learning rate, which is the major version of Adam adopted in deep learning tasks. We present the first convergence analysis of RR Adam without the bounded smoothness assumption. We demonstrate that RR Adam can maintain its convergence properties when smoothness is linearly bounded by the gradient norm, referred to as the \emph{$(L_0, L_1)$-smooth condition. We further compare Adam to SGD when both methods use diminishing learning rate. We refine the existing lower bound of SGD and show that SGD can be slower than Adam. To our knowledge, this is the first time that Adam and SGD are rigorously compared in the same setting and the advantage of Adam is revealed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题