论文标题
可扩展的二阶优化用于深度学习
Scalable Second Order Optimization for Deep Learning
论文作者
论文摘要
理论和应用的机器学习中的优化目前都由一阶梯度方法(例如随机梯度下降)主导。涉及数据的二阶优化方法和数据的二阶统计量,尽管理论特性很强,但由于其过度的计算,记忆和通信成本,尽管具有强大的理论性能。为了弥合理论和实用优化之间的这一差距,我们提出了二阶预处理方法的可扩展实现(具体地说,是全矩阵Adagrad的一种变体),并提供了几种关键的算法和数值改进,可提供与传统的一阶方法相比,可提供大量的融合和壁通路时间改进。我们的小说设计有效地利用了普遍的异质硬件体系结构来训练深层模型,该模型由多项CPU和多个加速器单元组成。与最先进的学习任务相比,我们证明了卓越的性能,例如具有变形金刚的机器翻译,使用BERT进行语言建模,Criteo上的点击率预测以及具有Resnet-50的Imagenet上的图像分类。
Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.