论文标题
了解数据并行性和稀疏对神经网络培训的影响
Understanding the Effects of Data Parallelism and Sparsity on Neural Network Training
论文作者
论文摘要
我们研究神经网络培训的两个因素:数据并行性和稀疏性;在这里,数据并行性意味着使用分布式系统(或等效地增加批量大小)并行处理培训数据,以便训练可以加速;对于稀疏性,我们指的是神经网络模型中的修剪参数,以降低计算和内存成本。但是,尽管他们有希望的好处,但是对他们对神经网络培训的影响的理解仍然难以捉摸。在这项工作中,我们首先通过进行广泛的实验来测量这些效果,同时调整参与优化的所有跨载体器。结果,我们在各种数据集,网络模型和优化算法中发现,批量大小和培训步骤数量之间存在一般规模的趋势,以融合数据并行性的影响,以及在稀疏下的培训难度。然后,我们基于随机梯度方法的收敛特性和优化景观的平滑度进行了理论分析,该特性精确地说明了观察到的现象,从而更好地说明了数据并行性和稀疏对神经网络训练的影响。
We study two factors in neural network training: data parallelism and sparsity; here, data parallelism means processing training data in parallel using distributed systems (or equivalently increasing batch size), so that training can be accelerated; for sparsity, we refer to pruning parameters in a neural network model, so as to reduce computational and memory cost. Despite their promising benefits, however, understanding of their effects on neural network training remains elusive. In this work, we first measure these effects rigorously by conducting extensive experiments while tuning all metaparameters involved in the optimization. As a result, we find across various workloads of data set, network model, and optimization algorithm that there exists a general scaling trend between batch size and number of training steps to convergence for the effect of data parallelism, and further, difficulty of training under sparsity. Then, we develop a theoretical analysis based on the convergence properties of stochastic gradient methods and smoothness of the optimization landscape, which illustrates the observed phenomena precisely and generally, establishing a better account of the effects of data parallelism and sparsity on neural network training.