DNN自动调整的HW感知初始化以改善勘探时间和稳健性

论文标题

DNN自动调整的HW感知初始化以改善勘探时间和稳健性

HW-Aware Initialization of DNN Auto-Tuning to Improve Exploration Time and Robustness

论文作者

Rieber, Dennis, Reiber, Moritz, Bringmann, Oliver, Fröning, Holger

论文摘要

通过ML模型和称为自动调整的硬件的DNN操作员优化延迟的过程，已成为一种普遍存在的神经网络部署的方法。从循环式优化的搜索空间中，必须选择提供最佳性能的候选人。通过硬件测量评估单个配置的性能。可能的配置的组合爆炸以及硬件评估的成本使搜索空间在实践中详尽地探索。机器学习方法（如随机森林或增强学习）用于帮助选择候选人进行硬件评估。与Cudnn这样的手工优化的库相比，可以实现X86和GPGPU体系结构（例如X86和GPGPU架构的令人印象深刻的性能提升）的通用硬件。该方法在具有较大广泛采用的硬件加速器的空间中也很有用，那里的高性能库并不总是可用。但是，硬件加速器通常相对于其编程的灵活性较小，这导致无法在硬件目标上执行操作员配置。这项工作评估了这些无效的配置如何影响VTA硬件的自动调整过程及其基础性能预测模型。从这些结果中，开发了一种有效性驱动的初始化方法，仅需要41.6％的必要硬件测量来找到最佳解决方案，同时改善搜索鲁棒性。

The process of optimizing the latency of DNN operators with ML models and hardware-in-the-loop, called auto-tuning, has established itself as a pervasive method for the deployment of neural networks. From a search space of loop-optimizations, the candidate providing the best performance has to be selected. Performance of individual configurations is evaluated through hardware measurements. The combinatorial explosion of possible configurations, together with the cost of hardware evaluation makes exhaustive explorations of the search space infeasible in practice. Machine Learning methods, like random forests or reinforcement learning are used to aid in the selection of candidates for hardware evaluation. For general purpose hardware like x86 and GPGPU architectures impressive performance gains can be achieved, compared to hand-optimized libraries like cuDNN. The method is also useful in the space of hardware accelerators with less wide-spread adoption, where a high-performance library is not always available. However, hardware accelerators are often less flexible with respect to their programming which leads to operator configurations not executable on the hardware target. This work evaluates how these invalid configurations affect the auto-tuning process and its underlying performance prediction model for the VTA hardware. From these results, a validity-driven initialization method for AutoTVM is developed, only requiring 41.6% of the necessary hardware measurements to find the best solution, while improving search robustness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题