论文标题
Rezero就是您所需要的:大深度快速收敛
ReZero is All You Need: Fast Convergence at Large Depth
论文作者
论文摘要
由于信号传播效率低下,深层网络通常会遭受消失或爆炸梯度的影响,从而导致较长的训练时间或收敛困难。已证明各种体系结构设计,复杂的残留式网络和初始化方案可改善深度信号的传播。最近,Pennington等。使用自由概率理论表明,动态等轴测图在有效的深度学习中起着不可或缺的作用。我们表明,使用单个零始于的参数对每个残差连接的门控的最简单架构更改满足初始动态等轴测图,并且胜过更复杂的方法。尽管该门比其前任要简单得多,但该门可以培训数千个完全连接的层,并具有快速收敛性和在CIFAR-10培训的重新NETS的更好的测试性能。我们将此技术应用于语言建模,发现我们可以轻松训练120层变压器。当应用于12层变压器时,它在ENWIKI8上的收敛速度加快了56%。
Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al. used free probability theory to show that dynamical isometry plays an integral role in efficient deep learning. We show that the simplest architecture change of gating each residual connection using a single zero-initialized parameter satisfies initial dynamical isometry and outperforms more complex approaches. Although much simpler than its predecessors, this gate enables training thousands of fully connected layers with fast convergence and better test performance for ResNets trained on CIFAR-10. We apply this technique to language modeling and find that we can easily train 120-layer Transformers. When applied to 12 layer Transformers, it converges 56% faster on enwiki8.