使用异质数据进行分散SGD的精制收敛和拓扑学习

论文标题

使用异质数据进行分散SGD的精制收敛和拓扑学习

Refined Convergence and Topology Learning for Decentralized SGD with Heterogeneous Data

论文作者

Bars, Batiste Le, Bellet, Aurélien, Tommasi, Marc, Lavoie, Erick, Kermarrec, Anne-Marie

论文摘要

分散和联合学习的关键挑战之一是设计算法，这些算法有效地处理跨代理商的高度异质数据分布。在本文中，我们在数据异质性下重新审视对流行的分散的随机梯度下降算法（D-SGD）的分析。我们展示了新数量（称为邻里异质性）在D-SGD的收敛速率上发挥的关键作用。通过结合通信拓扑结构和异质性，我们的分析阐明了这两个概念之间的相互作用不足。然后，我们认为邻里的异质性提供了学习数据依赖性拓扑结构的自然标准，这些拓扑减少（甚至可以消除）数据异质性对D-SGD收敛时间的有害影响。对于与标签偏度分类的重要情况，我们制定了学习这样一个良好拓扑结构的问题，例如我们使用Frank-Wolfe算法解决的可拖动优化问题。如一组模拟和现实世界实验所示，我们的方法为设计稀疏拓扑提供了一种原则方法，可以在数据异质性下平衡D-SGD的收敛速度和每卷曲的每卷通信成本。

One of the key challenges in decentralized and federated learning is to design algorithms that efficiently deal with highly heterogeneous data distributions across agents. In this paper, we revisit the analysis of the popular Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity. We exhibit the key role played by a new quantity, called neighborhood heterogeneity, on the convergence rate of D-SGD. By coupling the communication topology and the heterogeneity, our analysis sheds light on the poorly understood interplay between these two concepts. We then argue that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental effect of data heterogeneity on the convergence time of D-SGD. For the important case of classification with label skew, we formulate the problem of learning such a good topology as a tractable optimization problem that we solve with a Frank-Wolfe algorithm. As illustrated over a set of simulated and real-world experiments, our approach provides a principled way to design a sparse topology that balances the convergence speed and the per-iteration communication costs of D-SGD under data heterogeneity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题