Woodpecker-DL：通过硬件感知的多方面优化加速深度神经网络

论文标题

Woodpecker-DL：通过硬件感知的多方面优化加速深度神经网络

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

论文作者

Liu, Yongchao, Jin, Yue, Chen, Yong, Teng, Teng, Ou, Hang, Zhao, Rui, Zhang, Yao

论文摘要

在实践中，加速深层模型培训和推断至关重要。现有的深度学习框架通常集中于优化训练速度，并更少注意特定于推理的优化。实际上，模型推断与计算方面的培训不同，例如在训练过程中，将参数刷新每个梯度更新步骤，但在推理过程中保持不变。模型推理的这些特殊特征为其优化开辟了新的机会。在本文中，我们提出了一个硬件感知的优化框架，即Woodpecker-DL（WPK），以通过从图形优化的角度利用多个关节优化，自动化搜索，域特异性语言（DSL）编译器技术和系统层面勘探来加速推理。在WPK中，我们根据遗传算法和强化学习研究了两种新的自动搜索方法，以搜索针对特定硬件的最佳操作员代码配置。这些搜索算法进一步附加了自定义的DSL编译器，以生成有效的代码。为了创建优化的推理计划，WPK系统地探索了第三方库中的高速操作员实现，除了我们的自动生成的代码和单身供每个操作员使用的最佳实现。广泛的实验表明，在特斯拉P100 GPU上，我们可以在单个卷积运算符上实现Cudnn的最大速度为5.40，而TVM上的最大加速度为1.63，并且对于端到端模型推断而言，高达1.18倍。

Accelerating deep model training and inference is crucial in practice. Existing deep learning frameworks usually concentrate on optimizing training speed and pay fewer attentions to inference-specific optimizations. Actually, model inference differs from training in terms of computation, e.g. parameters are refreshed each gradient update step during training, but kept invariant during inference. These special characteristics of model inference open new opportunities for its optimization. In this paper, we propose a hardware-aware optimization framework, namely Woodpecker-DL (WPK), to accelerate inference by taking advantage of multiple joint optimizations from the perspectives of graph optimization, automated searches, domain-specific language (DSL) compiler techniques and system-level exploration. In WPK, we investigated two new automated search approaches based on genetic algorithm and reinforcement learning, respectively, to hunt the best operator code configurations targeting specific hardware. A customized DSL compiler is further attached to these search algorithms to generate efficient codes. To create an optimized inference plan, WPK systematically explores high-speed operator implementations from third-party libraries besides our automatically generated codes and singles out the best implementation per operator for use. Extensive experiments demonstrated that on a Tesla P100 GPU, we can achieve the maximum speedup of 5.40 over cuDNN and 1.63 over TVM on individual convolution operators, and run up to 1.18 times faster than TensorRT for end-to-end model inference.

下载PDF全文

下载文献需遵守相关版权规定

论文标题