随时带有延迟梯度的Minibatch

论文标题

随时带有延迟梯度的Minibatch

Anytime Minibatch with Delayed Gradients

论文作者

Al-Lawati, Haider, Draper, Stark C.

论文摘要

分布式优化在实践中广泛部署，以解决广泛的问题。在典型的异步方案中，工人在主使用过时（即延迟）梯度来更新参数的同时，计算出相对于过时的优化参数的梯度。虽然使用陈旧的梯度可以减慢收敛性，但异步方法通过允许更频繁的更新和减少空转时间来加快对壁时钟时间的整体优化。在本文中，我们提出了一个可变的人均minibatch方案，称为Anytime Minibatch，带有延迟梯度（AMB-DG）。在AMB-DG中，工人在固定时间的时期计算梯度，而主人使用陈旧梯度来更新优化参数。我们根据其遗憾和融合率分析了AMB-DG。我们证明，对于凸平的平稳目标函数，AMB-DG达到了最佳的遗憾结合和收敛率。我们将AMB-DG的性能与任何时间Minibatch（AMB）的性能进行了比较，该性能与AMB-DG相似，但不使用陈旧的梯度。在AMB中，工人在每个梯度传输到主机后保持空闲状态，直到他们在AMB-DG工人中收到主人的更新参数。我们还将AMB-DG扩展到完全分布的设置。当通信延迟延长时，我们将AMB-DG与AMB进行比较，并且观察到AMB-DG在壁时钟时间的收敛速度比AMB快。我们还将AMB-DG的性能与使用延迟梯度的最先进的固定Minibatch方法进行了比较。我们在实际分布式系统上运行实验，并观察到AMB-DG收敛超过两次。

Distributed optimization is widely deployed in practice to solve a broad range of problems. In a typical asynchronous scheme, workers calculate gradients with respect to out-of-date optimization parameters while the master uses stale (i.e., delayed) gradients to update the parameters. While using stale gradients can slow the convergence, asynchronous methods speed up the overall optimization with respect to wall clock time by allowing more frequent updates and reducing idling times. In this paper, we present a variable per-epoch minibatch scheme called Anytime Minibatch with Delayed Gradients (AMB-DG). In AMB-DG, workers compute gradients in epochs of a fixed time while the master uses stale gradients to update the optimization parameters. We analyze AMB-DG in terms of its regret bound and convergence rate. We prove that for convex smooth objective functions, AMB-DG achieves the optimal regret bound and convergence rate. We compare the performance of AMB-DG with that of Anytime Minibatch (AMB) which is similar to AMB-DG but does not use stale gradients. In AMB, workers stay idle after each gradient transmission to the master until they receive the updated parameters from the master while in AMB-DG workers never idle. We also extend AMB-DG to the fully distributed setting. We compare AMB-DG with AMB when the communication delay is long and observe that AMB-DG converges faster than AMB in wall clock time. We also compare the performance of AMB-DG with the state-of-the-art fixed minibatch approach that uses delayed gradients. We run our experiments on a real distributed system and observe that AMB-DG converges more than two times.

下载PDF全文

下载文献需遵守相关版权规定

论文标题