论文标题
深层神经网络的迷你块费舍尔方法
A Mini-Block Fisher Method for Deep Neural Networks
论文作者
论文摘要
目前,深层神经网络(DNN)主要使用一阶方法训练。其中一些方法(例如Adam,Adagrad和Rmsprop及其变体)通过使用对角线矩阵来预处理随机梯度。最近,通过通过按层的块 - diagonal矩阵对随机梯度进行预处理,开发了有效的二阶方法,例如KFAC,K-BFG,洗发水和TNT。在这里,我们提出了一种“迷你块Fisher(MBF)”预处理方法,其中在这两类方法之间。具体而言,我们的方法对经验渔民矩阵使用块 - 划线近似,在该基质中,DNN中的每一层,无论是卷积还是馈送方形和完全连接,相关的对角线本身都是块 - 双基因,并且由大量的适度尺寸的小型迷你块组成。我们的新方法利用GPU的并行性来有效地对每一层的大量矩阵进行计算。因此,MBF的均值计算成本仅略高于一阶方法。将我们提出的方法的性能与几种基线方法(在自动编码器和CNN问题上都进行),以在时间效率和概括功率方面验证其有效性。最后,证明MBF的理想化版本线性收敛。
Deep neural networks (DNNs) are currently predominantly trained using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal matrix to precondition the stochastic gradient. Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and TNT, have been developed for training DNNs, by preconditioning the stochastic gradient by layer-wise block-diagonal matrices. Here we propose a "mini-block Fisher (MBF)" preconditioned gradient method, that lies in between these two classes of methods. Specifically, our method uses a block-diagonal approximation to the empirical Fisher matrix, where for each layer in the DNN, whether it is convolutional or feed-forward and fully connected, the associated diagonal block is itself block-diagonal and is composed of a large number of mini-blocks of modest size. Our novel approach utilizes the parallelism of GPUs to efficiently perform computations on the large number of matrices in each layer. Consequently, MBF's per-iteration computational cost is only slightly higher than it is for first-order methods. The performance of our proposed method is compared to that of several baseline methods, on both autoencoder and CNN problems, to validate its effectiveness both in terms of time efficiency and generalization power. Finally, it is proved that an idealized version of MBF converges linearly.