分散深度学习的本地异步随机梯度下降

论文标题

分散深度学习的本地异步随机梯度下降

Locally Asynchronous Stochastic Gradient Descent for Decentralised Deep Learning

论文作者

Avidor, Tomer, Israel, Nadav Tal

论文摘要

深神经网络的分布式训练算法在非常大的问题上显示出令人印象深刻的收敛加速属性。但是，它们固有地遭受了与沟通相关的放缓的痛苦，并且沟通拓扑成为一个至关重要的设计选择。大多数机器学习框架支持的常见方法是：1）依赖对等的同步分散算法，所有这些都减少了对散乱者和通信延迟敏感的拓扑。 2）具有基于服务器的拓扑结构的异步集中算法，容易进行通信瓶颈。研究人员还提出了旨在避免瓶颈和加速训练的异步去中心化算法，但是，通常使用不精确的稀疏平均，这可能会导致准确性降解。在本文中，我们提出了局部异步SGD（LASGD），这是一种异步分散的算法，依赖于模型同步。我们从经验上验证了LASGD在ImageNet数据集上的图像分类任务上的性能。我们的实验表明，LASGD与SGD和基于ART八卦的方法相比，LASGD加速了训练。

Distributed training algorithms of deep neural networks show impressive convergence speedup properties on very large problems. However, they inherently suffer from communication related slowdowns and communication topology becomes a crucial design choice. Common approaches supported by most machine learning frameworks are: 1) Synchronous decentralized algorithms relying on a peer-to-peer All Reduce topology that is sensitive to stragglers and communication delays. 2) Asynchronous centralised algorithms with a server based topology that is prone to communication bottleneck. Researchers also suggested asynchronous decentralized algorithms designed to avoid the bottleneck and speedup training, however, those commonly use inexact sparse averaging that may lead to a degradation in accuracy. In this paper, we propose Local Asynchronous SGD (LASGD), an asynchronous decentralized algorithm that relies on All Reduce for model synchronization. We empirically validate LASGD's performance on image classification tasks on the ImageNet dataset. Our experiments demonstrate that LASGD accelerates training compared to SGD and state of the art gossip based approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题