基准测试网络织物用于数据分布的深神经网络培训

论文标题

基准测试网络织物用于数据分布的深神经网络培训

Benchmarking network fabrics for data distributed training of deep neural networks

论文作者

Samsi, Siddharth, Prout, Andrew, Jones, Michael, Kirby, Andrew, Arcand, Bill, Bergeron, Bill, Bestor, David, Byun, Chansup, Gadepally, Vijay, Houle, Michael, Hubbell, Matthew, Klein, Anna, Michaleas, Peter, Milechin, Lauren, Mullen, Julie, Rosa, Antonio, Yee, Charles, Reuther, Albert, Kepner, Jeremy

论文摘要

人工智能/机器学习应用需要在大量标记数据上培训复杂模型。训练深层模型的庞大计算要求需要开发新的方法来加快培训。一种方法是数据并行方法，其中训练数据分布在多个计算节点上。这种方法易于实现和支持大多数常用的机器学习框架。数据并行方法利用MPI来传达所有节点的梯度。在本文中，我们研究了使用不同的物理硬件互连和与网络相关的软件原始图的效果，以启用数据分布式深度学习。我们比较使用GPudirect和NCCL对以太网和Omnipath织物的影响。我们的结果表明，在共享HPC系统中使用基于以太网的网络对常用的深神经网络体系结构或传统HPC应用程序（例如计算流体动力学）的培训时间没有重大影响。

Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题