论文标题
分布式和联合优化的压缩梯度下降的加速
Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization
论文作者
论文摘要
由于分布式和联合学习问题的沟通成本很高,因此依赖传达信息压缩的方法变得越来越流行。在其他情况下,最佳性能梯度类型方法总是依赖于某种形式的加速度/动量来减少迭代次数,但没有任何方法可以结合梯度压缩和加速的好处。在本文中,我们纠正了这种情况,并提出了第一个加速的压缩梯度下降(ACGD)方法。在单个机器制度中,我们证明ACGD享受$ o \ big(((1+ω)\ sqrt {\ frac {\ frac {l}μ} \ log \ log \ frac {1}ε\ big)$ $ o \ big((1+ω)\ sqrt {\ frac {l}ε} \ big)$分别用于凸问题,其中$ω$是压缩参数。我们的结果改善了现有的非加速速率$ o \ big(((1+ω)\ frac {l}μ\ log \ log \ frac {1}ε\ big)$和$ o \ big(((1+ω)\ frac {l frac {l}ε\ big)$分别恢复了Ascercourtion Ascentient As Ascentient As Ascentient A Ascentient A Ascentient A Ascentient A Ascentient A Ascentient A Ascentient A Ascentient A Ascentient A Ascentient A Ascentient A an ($ω= 0 $)被应用。我们进一步提出了ACGD的分布式变体(称为Adiana),并证明了收敛速率$ \ widetilde {o} \ big(ω+\ sqrt {\ frac {l}μ}+\ sqrt {\ big(\fracΩ{n}+\ sqrt {\ sqrt {\fracΩ设备/工人和$ \ widetilde {o} $隐藏对数因子$ \ log \ frac {1}ε$。这在先前的最佳结果上有所改善,$ \ wideTilde {o} \ big(ω+ \ frac {l}μ+ \ frac {ωl} {nμ} \ big)$通过Mishchenko等人的Diana方法实现。 (2019)。最后,我们对现实世界数据集进行了几项实验,这些实验证实了我们的理论结果并确认了加速方法的实际优势。
Due to the high communication cost in distributed and federated learning problems, methods relying on compression of communicated messages are becoming increasingly popular. While in other contexts the best performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of iterations, there are no methods which combine the benefits of both gradient compression and acceleration. In this paper, we remedy this situation and propose the first accelerated compressed gradient descent (ACGD) methods. In the single machine regime, we prove that ACGD enjoys the rate $O\Big((1+ω)\sqrt{\frac{L}μ}\log \frac{1}ε\Big)$ for $μ$-strongly convex problems and $O\Big((1+ω)\sqrt{\frac{L}ε}\Big)$ for convex problems, respectively, where $ω$ is the compression parameter. Our results improve upon the existing non-accelerated rates $O\Big((1+ω)\frac{L}μ\log \frac{1}ε\Big)$ and $O\Big((1+ω)\frac{L}ε\Big)$, respectively, and recover the optimal rates of accelerated gradient descent as a special case when no compression ($ω=0$) is applied. We further propose a distributed variant of ACGD (called ADIANA) and prove the convergence rate $\widetilde{O}\Big(ω+\sqrt{\frac{L}μ}+\sqrt{\big(\fracω{n}+\sqrt{\fracω{n}}\big)\frac{ωL}μ}\Big)$, where $n$ is the number of devices/workers and $\widetilde{O}$ hides the logarithmic factor $\log \frac{1}ε$. This improves upon the previous best result $\widetilde{O}\Big(ω+ \frac{L}μ+\frac{ωL}{nμ} \Big)$ achieved by the DIANA method of Mishchenko et al. (2019). Finally, we conduct several experiments on real-world datasets which corroborate our theoretical results and confirm the practical superiority of our accelerated methods.