论文标题

部分可观测时空混沌系统的无模型预测

DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation

论文作者

Tang, Yu, Wang, Chenyu, Zhang, Yufan, Liu, Yuliang, Zhang, Xingcheng, Qiao, Linbo, Lai, Zhiquan, Li, Dongsheng

论文摘要

有限的GPU存储器资源阻碍了深度神经网络的进一步发展。因此,高度要求GPU内存资源的优化。通常应用交换和重新计算,以更好地利用GPU记忆。但是,作为一个新兴领域,仍然存在一些挑战:1)重新计算的效率均受到静态和动态方法的限制。 2)交换需要手动卸载参数,这会产生巨大的时间成本。 3)现在没有这种动态和细粒的方法,涉及张量与当今的张量重新组件一起交换。为了纠正上述问题,我们提出了一个名为Delta(动态张量卸载和重新组件)的新型调度程序经理。据我们所知,我们是第一个在没有用户监督的情况下组合张量交换和张量重新组件的合理组合的动态运行时间调度程序。在Delta中,我们提出了一种过滤器算法,以选择要从GPU内存中释放出来的最佳张量,并提出导演算法,以选择这些张量的适当动作。此外,故意考虑预取和重叠以克服交换和重新计算张量引起的时间成本。实验结果表明,DELTA不仅节省了40%-70%的GPU存储器,在很大程度上超过了最先进的方法,而且还获得了可比的收敛结果,并获得了可接受的时间延迟。此外,与基准相比,训练Resnet-101时,Delta在训练Resnet-50和2.25 $ \ times $时获得2.04 $ \ times $最大批量。此外,在我们的实验中,交换成本和重新计算成本之间的比较表明,在张量交换和张量重新计算上制定合理的动态调度程序的重要性,这在某些相关工作中反驳了交换应该是第一个也是最佳选择的论点。

The further development of deep neural networks is hampered by the limited GPU memory resource. Therefore, the optimization of GPU memory resources is highly demanded. Swapping and recomputation are commonly applied to make better use of GPU memory in deep learning. However, as an emerging domain, several challenges remain:1)The efficiency of recomputation is limited for both static and dynamic methods. 2)Swapping requires offloading parameters manually, which incurs a great time cost. 3) There is no such dynamic and fine-grained method that involves tensor swapping together with tensor recomputation nowadays. To remedy the above issues, we propose a novel scheduler manager named DELTA(Dynamic tEnsor offLoad and recompuTAtion). To the best of our knowledge, we are the first to make a reasonable dynamic runtime scheduler on the combination of tensor swapping and tensor recomputation without user oversight. In DELTA, we propose a filter algorithm to select the optimal tensors to be released out of GPU memory and present a director algorithm to select a proper action for each of these tensors. Furthermore, prefetching and overlapping are deliberately considered to overcome the time cost caused by swapping and recomputing tensors. Experimental results show that DELTA not only saves 40%-70% of GPU memory, surpassing the state-of-the-art method to a great extent but also gets comparable convergence results as the baseline with acceptable time delay. Also, DELTA gains 2.04$\times$ maximum batchsize when training ResNet-50 and 2.25$\times$ when training ResNet-101 compared with the baseline. Besides, comparisons between the swapping cost and recomputation cost in our experiments demonstrate the importance of making a reasonable dynamic scheduler on tensor swapping and tensor recomputation, which refutes the arguments in some related work that swapping should be the first and best choice.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源