可扩展的K-FAC培训，用于具有分布式预处理的深神经网络

论文标题

可扩展的K-FAC培训，用于具有分布式预处理的深神经网络

Scalable K-FAC Training for Deep Neural Networks with Distributed Preconditioning

论文作者

Zhang, Lin, Shi, Shaohuai, Wang, Wei, Li, Bo

论文摘要

二阶优化方法，尤其是D-KFAC（分布式Kronecker近似曲率）算法，在加速GPU群集上加速了深神经网络（DNN）训练方面已获得了吸引力。但是，现有的D-KFAC算法需要在预处理梯度之前计算和传达大量二阶信息，即Kronecker因素（KFS），从而导致大量的计算和通信开销以及高存储器足迹。在本文中，我们提出了DP-KFAC，这是一种新颖的分布式预处理方案，该方案将不同DNN层的KF构造任务分配给不同的工人。 DP-KFAC不仅保留了现有D-KFAC算法的收敛属性，而且还可以带来三个好处：构造KFS的计算开销减少，没有KFS的通信和低内存的足迹。在64-GPU群集上进行的广泛实验表明，DP-KFAC将开销的计算开销降低了1.55 x-1.65x，通信成本降低了2.79 x-3.15x，并且与现行的D-KFAC方法相比，每个二阶更新中的计算成本和记忆足迹在每个二阶更新中。

The second-order optimization methods, notably the D-KFAC (Distributed Kronecker Factored Approximate Curvature) algorithms, have gained traction on accelerating deep neural network (DNN) training on GPU clusters. However, existing D-KFAC algorithms require to compute and communicate a large volume of second-order information, i.e., Kronecker factors (KFs), before preconditioning gradients, resulting in large computation and communication overheads as well as a high memory footprint. In this paper, we propose DP-KFAC, a novel distributed preconditioning scheme that distributes the KF constructing tasks at different DNN layers to different workers. DP-KFAC not only retains the convergence property of the existing D-KFAC algorithms but also enables three benefits: reduced computation overhead in constructing KFs, no communication of KFs, and low memory footprint. Extensive experiments on a 64-GPU cluster show that DP-KFAC reduces the computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update compared to the state-of-the-art D-KFAC methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题