论文标题
SGD中的自适应TOP-K用于沟通效率分布式学习
Adaptive Top-K in SGD for Communication-Efficient Distributed Learning
论文作者
论文摘要
具有梯度压缩的分布式随机梯度下降(SGD)已成为加速分布式学习的流行沟通效率解决方案。一种常用的梯度压缩方法是TOP-K稀疏,在模型训练过程中,梯度稀疏。但是,缺乏适应稀疏度的自适应方法来最大程度地提高模型性能或训练速度的潜力。本文提出了一种新颖的SGD框架中新型自适应TOP-K,可为每个梯度下降步骤具有自适应程度的稀疏度,以通过平衡通信成本和收敛误差之间的权衡取舍来优化收敛性能。首先,为自适应稀疏方案和损耗函数得出了收敛误差的上限。其次,算法旨在最大程度地减少通信成本限制下的收敛误差。最后,对MNIST和CIFAR-10数据集的数值结果表明,即使考虑误差赔偿,SGD中提出的自适应TOP-K算法与最新方法相比,与最先进的方法相比,提出的自适应TOP-K算法的收敛速率明显更好。
Distributed stochastic gradient descent (SGD) with gradient compression has become a popular communication-efficient solution for accelerating distributed learning. One commonly used method for gradient compression is Top-K sparsification, which sparsifies the gradients by a fixed degree during model training. However, there has been a lack of an adaptive approach to adjust the sparsification degree to maximize the potential of the model's performance or training speed. This paper proposes a novel adaptive Top-K in SGD framework that enables an adaptive degree of sparsification for each gradient descent step to optimize the convergence performance by balancing the trade-off between communication cost and convergence error. Firstly, an upper bound of convergence error is derived for the adaptive sparsification scheme and the loss function. Secondly, an algorithm is designed to minimize the convergence error under the communication cost constraints. Finally, numerical results on the MNIST and CIFAR-10 datasets demonstrate that the proposed adaptive Top-K algorithm in SGD achieves a significantly better convergence rate compared to state-of-the-art methods, even after considering error compensation.