论文标题

有关CPU的深度学习模型:一种有效培训的方法

Deep Learning Models on CPUs: A Methodology for Efficient Training

论文作者

Fu, Quchen, Chukka, Ramesh, Achorn, Keith, Atta-fosu, Thomas, Canchi, Deepak R., Teng, Zhongwei, White, Jules, Schmidt, Douglas C.

论文摘要

由于其高度平行的架构,GPU受到训练深度学习模型的青睐。结果,大多数有关培训优化的研究都集中在GPU上。但是,在决定如何选择适当的培训硬件时,成本和效率之间通常会有权衡。特别是,如果对CPU的培训更有效,则CPU服务器可能会有益,因为它们会产生更少的硬件更新成本并更好地利用现有基础架构。本文为使用CPU的培训深度学习模型做出了一些贡献。首先,它提出了一种优化Intel CPU的深度学习模型培训的方法和一个称为ProfileDNN的工具包,我们开发了它来改善性能分析。其次,我们描述了一种通用培训优化方法,该方法指导我们的工作流程,并探讨了几个案例研究,我们确定了绩效问题,然后优化了Pytorch的Intel扩展,从而导致了Verinanet-Resnext50模型的总体2X训练性能提高。第三,我们展示了如何利用ProfileDNN的可视化功能,这使我们能够查明瓶颈并创建一个自定义的焦点损失内核,该内核比正式的参考Pytorch实现更快。

GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when deciding on how to choose the proper hardware for training. In particular, CPU servers can be beneficial if training on CPUs was more efficient, as they incur fewer hardware update costs and better utilizing existing infrastructure. This paper makes several contributions to research on training deep learning models using CPUs. First, it presents a method for optimizing the training of deep learning models on Intel CPUs and a toolkit called ProfileDNN, which we developed to improve performance profiling. Second, we describe a generic training optimization method that guides our workflow and explores several case studies where we identified performance issues and then optimized the Intel Extension for PyTorch, resulting in an overall 2x training performance increase for the RetinaNet-ResNext50 model. Third, we show how to leverage the visualization capabilities of ProfileDNN, which enabled us to pinpoint bottlenecks and create a custom focal loss kernel that was two times faster than the official reference PyTorch implementation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源