论文标题
摊销近端优化
Amortized Proximal Optimization
论文作者
论文摘要
我们提出了一个在线元优化参数的框架,称为优化,称为摊销近端优化(APO)。我们首先将各种现有的神经网络优化器解释为近似随机点近端方法,这些方法在功能空间和重量空间中均以接近术语进行了当前批次损失。 APO背后的想法是通过元学习更新规则的参数来摊销近端目标的最小化。我们展示了如何使用APO适应学习率或结构化的预处理矩阵。在适当的假设下,APO可以恢复现有的优化器,例如天然梯度下降和KFAC。它拥有低计算开销,避免了某些二阶优化器(例如矩阵倒置)所需的昂贵且数值敏感的操作。我们通过经验测试APO,以在线适应学习率和结构化的预处理矩阵,以进行回归,图像重建,图像分类和自然语言翻译任务。从经验上讲,APO发现的学习率时间表通常优于最佳固定学习率,并且具有手动调节的衰减时间表具有竞争力。使用APO适应结构化的预处理矩阵通常会通过二阶方法进行优化性能竞争。此外,缺乏矩阵反转提供了数值稳定性,使其可用于低精度训练。
We propose a framework for online meta-optimization of parameters that govern optimization, called Amortized Proximal Optimization (APO). We first interpret various existing neural network optimizers as approximate stochastic proximal point methods which trade off the current-batch loss with proximity terms in both function space and weight space. The idea behind APO is to amortize the minimization of the proximal point objective by meta-learning the parameters of an update rule. We show how APO can be used to adapt a learning rate or a structured preconditioning matrix. Under appropriate assumptions, APO can recover existing optimizers such as natural gradient descent and KFAC. It enjoys low computational overhead and avoids expensive and numerically sensitive operations required by some second-order optimizers, such as matrix inverses. We empirically test APO for online adaptation of learning rates and structured preconditioning matrices for regression, image reconstruction, image classification, and natural language translation tasks. Empirically, the learning rate schedules found by APO generally outperform optimal fixed learning rates and are competitive with manually tuned decay schedules. Using APO to adapt a structured preconditioning matrix generally results in optimization performance competitive with second-order methods. Moreover, the absence of matrix inversion provides numerical stability, making it effective for low precision training.