Mimose：用于GPU有效培训的输入意识检查点计划者

论文标题

Mimose：用于GPU有效培训的输入意识检查点计划者

Mimose: An Input-Aware Checkpointing Planner for Efficient Training on GPU

论文作者

Liao, Jianjin, Li, Mingzhen, Sun, Qingxiao, Hao, Jiwei, Yu, Fengwei, Chen, Shengdong, Tao, Ye, Zhang, Zicheng, Yang, Hailong, Luan, Zhongzhi, Qian, Depei

论文摘要

较大的深度学习模型通常会带来更高的模型质量，而GPU记忆足迹不断增加。尽管已经提出了张量检查点技术来在受限的GPU内存预算下进行培训，但未探索输入张量动力学以优化性能，同时降低GPU内存足迹。具体而言，由于数据集和随后的数据论证，每个小型批次的输入张量大小在训练过程中是动态的，导致GPU内存足迹的变化。但是，为了在检查点中利用这种输入张量动力学，需要解决两个挑战。首先，由于输入张量的动态，需要在运行时确定检查点计划。其次，检查点计划需要随时应用，而不会显着恶化性能。在本文中，我们提出了Mimose，这是一种尊重内存预算的输入感知的张量检查点，同时可以在GPU上进行有效的模型培训。 Mimose在线建立了一个轻巧但准确的GPU内存使用预测模型，而无需预先分析模型。它基于每层内存预测生成张量检查点计划，并将其应用于即时训练进度。它还采用了一种缓存策略，以避免重新生成重复输入大小的计划。我们的实验表明，与在同一GPU内存预算下的最先进的记忆计划者相比，Mimose获得了卓越的训练吞吐量。

Larger deep learning models usually lead to higher model quality with an ever-increasing GPU memory footprint. Although tensor checkpointing techniques have been proposed to enable training under a restricted GPU memory budget, the input tensor dynamics have been unexploited for optimizing performance while reducing GPU memory footprint. Specifically, due to the diverse datasets and subsequent data argumentation, the input tensor size per mini-batch is dynamic during the training process, leading to a changing GPU memory footprint. However, to leverage such input tensor dynamics in checkpointing, there are two challenges to be solved. First, the checkpointing plan needs to be determined during runtime due to the dynamics of input tensors. Second, the checkpointing plan needs to be applied on the fly without significantly deteriorating the performance. In this paper, we propose Mimose, an input-aware tensor checkpointing planner respecting the memory budget while enabling efficient model training on GPU. Mimose builds a lightweight but accurate prediction model of GPU memory usage online, without pre-analyzing the model. It generates a tensor checkpointing plan based on per-layer memory prediction and applies it to training progress on the fly. It also adopts a caching strategy to avoid having to regenerate the plan for repeated input size. Our experiments show that Mimose achieves superior training throughput compared to state-of-the-art memory planners under the same GPU memory budgets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题