论文标题
反向链接:通过向后链接进行监督的本地培训
BackLink: Supervised Local Training with Backward Links
论文作者
论文摘要
在逆向传播(BP)算法的授权下,深层神经网络在解决各种认知任务方面主导了种族。标准BP中受限制的训练模式需要端到端误差传播,导致大量记忆成本并禁止模型并行化。现有的本地培训方法旨在通过完全切断模块之间的向后路径并隔离其梯度以降低记忆成本并加速训练过程来解决训练障碍。这些方法可以防止模块之间流动的错误和信息交换,从而导致性能较低。这项工作提出了一种新型的本地培训算法Backlink,该算法会引入模块间依赖性,并允许在模块之间流动错误。该算法有助于与网络一起向后流动。为了保留本地培训的计算优势,反向链接限制了模块内的误差传播长度。在各种深度卷积神经网络中进行的广泛实验表明,我们的方法始终如一地改善本地训练算法的分类性能,而不是其他方法。例如,在具有16个本地模块的Resnet32中,我们的方法分别超过了常规的贪婪本地训练方法4.00 \%,而CIFAR10上的准确性分别超过1.83 \%。计算成本的分析表明,在多个GPU上,GPU内存成本和运行时会产生小型开销。与标准BP相比,我们的方法可导致记忆成本的79 \%降低,在RESNET110中的模拟运行时52 \%。因此,我们的方法可以为改善培训算法提高效率和生物学合理性创造新的机会。
Empowered by the backpropagation (BP) algorithm, deep neural networks have dominated the race in solving various cognitive tasks. The restricted training pattern in the standard BP requires end-to-end error propagation, causing large memory cost and prohibiting model parallelization. Existing local training methods aim to resolve the training obstacle by completely cutting off the backward path between modules and isolating their gradients to reduce memory cost and accelerate the training process. These methods prevent errors from flowing between modules and hence information exchange, resulting in inferior performance. This work proposes a novel local training algorithm, BackLink, which introduces inter-module backward dependency and allows errors to flow between modules. The algorithm facilitates information to flow backward along with the network. To preserve the computational advantage of local training, BackLink restricts the error propagation length within the module. Extensive experiments performed in various deep convolutional neural networks demonstrate that our method consistently improves the classification performance of local training algorithms over other methods. For example, in ResNet32 with 16 local modules, our method surpasses the conventional greedy local training method by 4.00\% and a recent work by 1.83\% in accuracy on CIFAR10, respectively. Analysis of computational costs reveals that small overheads are incurred in GPU memory costs and runtime on multiple GPUs. Our method can lead up to a 79\% reduction in memory cost and 52\% in simulation runtime in ResNet110 compared to the standard BP. Therefore, our method could create new opportunities for improving training algorithms towards better efficiency and biological plausibility.