分析和减轻DNN培训中的数据失速

论文标题

分析和减轻DNN培训中的数据失速

Analyzing and Mitigating Data Stalls in DNN Training

论文作者

Mohan, Jayashree, Phanishayee, Amar, Raniwala, Ashish, Chidambaram, Vijay

论文摘要

培训深度神经网络（DNNS）是资源密集型且耗时的。虽然先前的研究探索了减少DNN训练时间的许多不同方法，但输入数据管道的影响，即从存储中获取原始数据项并在内存中进行预处理的数据进行了相对探索。本文做出了以下贡献：（1）我们对输入数据管道如何影响广泛使用的计算机视觉和音频深神经网络（DNNS）的训练时间进行了首次综合分析，通常涉及复杂的数据预处理。我们在三个任务和四个数据集中分析了九种不同的模型，同时在Microsoft上大型生产群集的服务器上的内存量，内存数量，CPU线程数，存储设备，GPU生成等。我们发现，在许多情况下，DNN训练时间由数据失速时间主导：等待数据被收集和预处理的时间。（2）我们构建一种工具，DS-Analyzer，使用差异技术精确地测量数据失速，并对数据失速进行预测性何种分析。（3）最后，根据我们的分析的见解，我们在数据加载库Coordl中设计并实施了三种简单但有效的技术，以减轻数据失速。我们对一系列DNN任务，模型，数据集和硬件配置的实验表明，当Pytorch使用coordl而不是最先进的DALI数据加载库时，DNN培训时间会大大减少（在单个服务器上高达5倍）。

Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline, i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data preprocessing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time: time spent waiting for data to be fetched and preprocessed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5x on a single server).

下载PDF全文

下载文献需遵守相关版权规定

论文标题