GBA：一种无调的方法，用于在同步和异步培训之间切换推荐模型

论文标题

GBA：一种无调的方法，用于在同步和异步培训之间切换推荐模型

GBA: A Tuning-free Approach to Switch between Synchronous and Asynchronous Training for Recommendation Model

论文作者

Su, Wenbo, Zhang, Yuanxing, Cai, Yufeng, Ren, Kaixu, Wang, Pengjie, Yi, Huimin, Song, Yue, Chen, Jing, Deng, Hongbo, Xu, Jian, Qu, Lin, zheng, Bo

论文摘要

参数服务器（PS）体系结构（PS）体系结构和高性能同步训练（AR）体系结构上的高频率异步培训是建议模型最常见的分布式培训模式。尽管同步AR培训旨在具有较高的训练效率，但是当共享集群中有散乱者（缓慢的工人），尤其是在有限的计算资源下，异步的PS培训将是训练速度的更好选择。充分利用这两种训练模式的理想方法是在群集状态下切换它们。但是，切换训练模式通常需要调整超参数，这是极度时间和资源的耗费。我们发现了无调方法的两个障碍：梯度值的不同分布和来自散乱者的陈旧梯度。本文提出了PS上的全球批处理梯度聚合（GBA），该梯度汇总并应用了具有与同步训练相同的全球批次大小的梯度。实施了代币控制过程，以组装梯度并以严重的稳定性衰减梯度。我们提供收敛分析，以揭示GBA与同步训练具有可比的收敛性能，并证明GBA的鲁棒性具有针对梯度稳定性的推荐模型。在三个工业规模的推荐任务上进行的实验表明，GBA是一种有效的无调切换方法。与最先进的异步训练相比，GBA在AUC度量方面提高了0.2％，这对于建议模型来说是重要的。同时，在紧张的硬件资源下，与同步训练相比，GBA至少加快了2.4倍。

High-concurrency asynchronous training upon parameter server (PS) architecture and high-performance synchronous training upon all-reduce (AR) architecture are the most commonly deployed distributed training modes for recommendation models. Although synchronous AR training is designed to have higher training efficiency, asynchronous PS training would be a better choice for training speed when there are stragglers (slow workers) in the shared cluster, especially under limited computing resources. An ideal way to take full advantage of these two training modes is to switch between them upon the cluster status. However, switching training modes often requires tuning hyper-parameters, which is extremely time- and resource-consuming. We find two obstacles to a tuning-free approach: the different distribution of the gradient values and the stale gradients from the stragglers. This paper proposes Global Batch gradients Aggregation (GBA) over PS, which aggregates and applies gradients with the same global batch size as the synchronous training. A token-control process is implemented to assemble the gradients and decay the gradients with severe staleness. We provide the convergence analysis to reveal that GBA has comparable convergence properties with the synchronous training, and demonstrate the robustness of GBA the recommendation models against the gradient staleness. Experiments on three industrial-scale recommendation tasks show that GBA is an effective tuning-free approach for switching. Compared to the state-of-the-art derived asynchronous training, GBA achieves up to 0.2% improvement on the AUC metric, which is significant for the recommendation models. Meanwhile, under the strained hardware resource, GBA speeds up at least 2.4x compared to synchronous training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题