用自适应求和缩放分布式培训

论文标题

用自适应求和缩放分布式培训

Scaling Distributed Training with Adaptive Summation

论文作者

Maleki, Saeed, Musuvathi, Madan, Mytkowicz, Todd, Saarikivi, Olli, Xu, Tianju, Eksarevskiy, Vadim, Ekanayake, Jaliya, Barsoum, Emad

论文摘要

随机梯度下降（SGD）是一种固有的顺序训练算法 - 对批处理$ i $的梯度计算取决于从批处理$ i-1 $中学到的模型参数。破坏这种依赖的事先方法并不能尊重它们（例如，总结每批的梯度，这不是顺序SGD会做的），因此可能会遭受不良的收敛性。本文介绍了一种新型方法，将称为adasum的梯度（用于自适应总和）相结合，该梯度比先前的工作更快。 Adasum易于实现，几乎与简单地求和梯度一样有效，并且已集成到开源工具包HOROVOD中。本文首先为阿法苏姆提供了形式上的理由，然后在经验上证明阿adasum比先前的梯度积累方法更准确。然后，它引入了一系列案例研究，以将多个框架（Tensorflow和Pytorch）的adasum作品显示，将多个优化器（动量 - SGD，Adam和Lamb）缩放到更大的批量尺寸，同时仍然具有良好的下游精度。最后，它证明了阿法苏姆融合。总而言之，Adasum在MLPERF RESNET50基准上的动量SGD在沟通前至64K示例（没有MLPERF v0.5参赛条目收集超过16K），Adam Optimizer to 64k示例为64K示例，然后在Bert-large上进行交流之前（先前的工作显示Adam在16k上停止量表），并在16k上使用了6-lamb Optive the Lamb Optiver to lamb to lamb to lamb to lamb to 128，bar to 128 bel bel to 128 bel of to 128 bert of to bert of to 128 b。同时保持下游精度指标。最后，如果用户不需要扩展，我们在Bert-Large上显示带有Adasum的羔羊，比基线少30％。

Stochastic gradient descent (SGD) is an inherently sequential training algorithm--computing the gradient at batch $i$ depends on the model parameters learned from batch $i-1$. Prior approaches that break this dependence do not honor them (e.g., sum the gradients for each batch, which is not what sequential SGD would do) and thus potentially suffer from poor convergence. This paper introduces a novel method to combine gradients called Adasum (for adaptive sum) that converges faster than prior work. Adasum is easy to implement, almost as efficient as simply summing gradients, and is integrated into the open-source toolkit Horovod. This paper first provides a formal justification for Adasum and then empirically demonstrates Adasum is more accurate than prior gradient accumulation methods. It then introduces a series of case-studies to show Adasum works with multiple frameworks, (TensorFlow and PyTorch), scales multiple optimizers (Momentum-SGD, Adam, and LAMB) to larger batch-sizes while still giving good downstream accuracy. Finally, it proves that Adasum converges. To summarize, Adasum scales Momentum-SGD on the MLPerf Resnet50 benchmark to 64K examples before communication (no MLPerf v0.5 entry converged with more than 16K), the Adam optimizer to 64K examples before communication on BERT-LARGE (prior work showed Adam stopped scaling at 16K), and the LAMB optimizer to 128K before communication on BERT-LARGE (prior work used 64K), all while maintaining downstream accuracy metrics. Finally, if a user does not need to scale, we show LAMB with Adasum on BERT-LARGE converges in 30% fewer steps than the baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题