在联合学习中驯服脂肪尾（“具有无限差异）

论文标题

在联合学习中驯服脂肪尾（“具有无限差异）

Taming Fat-Tailed ("Heavier-Tailed'' with Potentially Infinite Variance) Noise in Federated Learning

论文作者

Yang, Haibo, Qiu, Peiwen, Liu, Jia

论文摘要

在大多数现有的FL算法的收敛分析中，大多数现有作品中的一个关键假设是，随机一阶信息中的噪声具有有限的差异。尽管该假设涵盖了所有轻尾（即，次指定）和一些重尾噪声分布（例如，对数正常，威布尔和一些帕托托分布），但对于许多较重的噪声噪声分布（即，``````重尾'））''的噪声分布失败了。迄今为止，尚不清楚是否可以针对经历脂肪尾噪声的FL系统设计收敛算法。 This motivates us to fill this gap in this paper by proposing an algorithmic framework called FAT-Clipping (\ul{f}ederated \ul{a}veraging with \ul{t}wo-sided learning rates and \ul{clipping}), which contains two variants: FAT-Clipping per-round (FAT-Clipping-PR) and FAT-Clipping per-iteration (FAT-Clipping-PI).具体而言，对于（1,2] $中最大的$α\，因此FL中的脂肪尾噪声仍然具有有限的$α$ - amment，我们表明这两个变体都可以实现$ \ MATHCAL {O}（MT）^{\ frac {\ frac {2-α}α}α}α}α）$和和和和和$ \ MATHCAL {O}（（MT）^{\ frac {1-α} {3α-2}}）$收敛速率分别是$ m $和$ t $，分别是客户的数量和交流。加速对每个客户端的本地更新数量以及较低匹配的匹配（即，订单最佳）的效果，我们的结果可以使我们对设计有效的FL系统的有效算法的理解。

A key assumption in most existing works on FL algorithms' convergence analysis is that the noise in stochastic first-order information has a finite variance. Although this assumption covers all light-tailed (i.e., sub-exponential) and some heavy-tailed noise distributions (e.g., log-normal, Weibull, and some Pareto distributions), it fails for many fat-tailed noise distributions (i.e., ``heavier-tailed'' with potentially infinite variance) that have been empirically observed in the FL literature. To date, it remains unclear whether one can design convergent algorithms for FL systems that experience fat-tailed noise. This motivates us to fill this gap in this paper by proposing an algorithmic framework called FAT-Clipping (\ul{f}ederated \ul{a}veraging with \ul{t}wo-sided learning rates and \ul{clipping}), which contains two variants: FAT-Clipping per-round (FAT-Clipping-PR) and FAT-Clipping per-iteration (FAT-Clipping-PI). Specifically, for the largest $α\in (1,2]$ such that the fat-tailed noise in FL still has a bounded $α$-moment, we show that both variants achieve $\mathcal{O}((mT)^{\frac{2-α}α})$ and $\mathcal{O}((mT)^{\frac{1-α}{3α-2}})$ convergence rates in the strongly-convex and general non-convex settings, respectively, where $m$ and $T$ are the numbers of clients and communication rounds. Moreover, at the expense of more clipping operations compared to FAT-Clipping-PR, FAT-Clipping-PI further enjoys a linear speedup effect with respect to the number of local updates at each client and being lower-bound-matching (i.e., order-optimal). Collectively, our results advance the understanding of designing efficient algorithms for FL systems that exhibit fat-tailed first-order oracle information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题