论文标题
弹性一致性:分布式随机梯度下降的一般一致性模型
Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent
论文作者
论文摘要
近年来,机器学习取得了巨大的进步,模型在一系列专业任务上匹配甚至超过了人类。近年来机器学习进展的一个关键要素是能够在大规模分布式共享记忆和消息通讯环境中训练机器学习模型。这些模型中的许多模型是使用基于随机梯度下降(SGD)优化的变体的训练有素的。 在本文中,我们介绍了一般的一致性条件,涵盖了降低通信和异步分布式SGD实现。我们的框架称为弹性一致性,使我们能够为训练大型机器学习模型的各种分布式SGD方法得出收敛范围。所提出的框架消除了实施特定的收敛分析,并为获得收敛范围提供了抽象。我们利用该框架在异步设置中分析了凸面和非凸目标的分布式SGD方法的稀疏方案。我们实施分布式SGD变体以在异步共享内存设置中训练深CNN模型。经验结果表明,错误反馈可能不一定有助于改善稀疏异步分布式SGD的收敛性,这证实了通过我们的收敛分析提出的见解。
Machine learning has made tremendous progress in recent years, with models matching or even surpassing humans on a series of specialized tasks. One key element behind the progress of machine learning in recent years has been the ability to train machine learning models in large-scale distributed shared-memory and message-passing environments. Many of these models are trained employing variants of stochastic gradient descent (SGD) based optimization. In this paper, we introduce a general consistency condition covering communication-reduced and asynchronous distributed SGD implementations. Our framework, called elastic consistency enables us to derive convergence bounds for a variety of distributed SGD methods used in practice to train large-scale machine learning models. The proposed framework de-clutters the implementation-specific convergence analysis and provides an abstraction to derive convergence bounds. We utilize the framework to analyze a sparsification scheme for distributed SGD methods in an asynchronous setting for convex and non-convex objectives. We implement the distributed SGD variant to train deep CNN models in an asynchronous shared-memory setting. Empirical results show that error-feedback may not necessarily help in improving the convergence of sparsified asynchronous distributed SGD, which corroborates an insight suggested by our convergence analysis.