论文标题
随机自动分化
Randomized Automatic Differentiation
论文作者
论文摘要
深度学习,变异推理和许多其他领域的成功得到了反向模式自动分化(AD)的专业实现,以计算巨大目标的梯度。这些工具为基础的广告技术旨在将精确的梯度计算为数值精确度,但是现代机器学习模型几乎总是通过随机梯度下降来训练。为什么要将计算和内存花在精确的(Minibatch)梯度上只是将其用于随机优化?我们为随机自动分化(RAD)开发了一种通用框架和方法,该框架可以通过减少的内存来计算无偏梯度估计,以换取方差。我们研究了一般方法的局限性,并认为我们必须利用特定问题的结构来实现收益。我们为各种简单的神经网络体系结构开发了RAD技术,并表明,对于固定内存预算,RAD收敛于迭代率要比使用小批量大小来进行前馈网络,而在相似的数字中,对于经常性网络。我们还表明,RAD可以应用于科学计算,并使用它来开发低内存的随机梯度方法,以优化代表裂变反应器的线性反应扩散PDE的控制参数。
The successes of deep learning, variational inference, and many other fields have been aided by specialized implementations of reverse-mode automatic differentiation (AD) to compute gradients of mega-dimensional objectives. The AD techniques underlying these tools were designed to compute exact gradients to numerical precision, but modern machine learning models are almost always trained with stochastic gradient descent. Why spend computation and memory on exact (minibatch) gradients only to use them for stochastic optimization? We develop a general framework and approach for randomized automatic differentiation (RAD), which can allow unbiased gradient estimates to be computed with reduced memory in return for variance. We examine limitations of the general approach, and argue that we must leverage problem specific structure to realize benefits. We develop RAD techniques for a variety of simple neural network architectures, and show that for a fixed memory budget, RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks. We also show that RAD can be applied to scientific computing, and use it to develop a low-memory stochastic gradient method for optimizing the control parameters of a linear reaction-diffusion PDE representing a fission reactor.