论文标题
大批次培训不需要热身
Large Batch Training Does Not Need Warmup
论文作者
论文摘要
使用大批量的培训深层神经网络已显示出令人鼓舞的结果,并使许多现实世界应用受益。但是,优化器在早期的时期会缓慢收敛,并且大批量深度学习优化启发式和理论基础之间存在差距。在本文中,我们提出了一种用于大批量训练的新型完整层的自适应速率缩放率(Clars)算法。我们还通过引入基于梯度的方法进行新的细粒分析来分析所提出方法的收敛速率。基于我们的分析,我们弥合了差距,并说明了三种流行的大批量训练技术的理论见解,包括线性学习率缩放,逐渐的热身和层次适应性率缩放。广泛的实验表明,所提出的算法通过很大的边距优于逐步的热身技术,并在Imagenet数据集上击败了训练先进的深层神经网络(Resnet,Densenet,Mobilenet)中最先进的大批量优化器的收敛性。
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. However, the optimizer converges slowly at early epochs and there is a gap between large-batch deep learning optimization heuristics and theoretical underpinnings. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. We also analyze the convergence rate of the proposed method by introducing a new fine-grained analysis of gradient-based methods. Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques, including linear learning rate scaling, gradual warmup, and layer-wise adaptive rate scaling. Extensive experiments demonstrate that the proposed algorithm outperforms gradual warmup technique by a large margin and defeats the convergence of the state-of-the-art large-batch optimizer in training advanced deep neural networks (ResNet, DenseNet, MobileNet) on ImageNet dataset.