论文标题
边缘的推理潜伏期预测
Inference Latency Prediction at the Edge
论文作者
论文摘要
随着移动设备上推理任务的越来越多的工作量,最先进的神经体系结构(NAS)通常是通过神经体系结构搜索(NAS)设计的,以识别NAS在准确性和效率之间具有良好权衡的NA(例如延迟)。由于在NAS期间测量了一系列候选架构的潜伏期是不可扩展的,因此需要方法来预测移动设备上的端到端推理潜伏期。由于硬件异质性,ML框架应用的优化以及神经体系结构的多样性,此类预测具有挑战性。在本文中,在这些挑战中,我们首先评估神经体系结构和移动设备的特征,这些设备对推理潜伏期产生了重大影响。基于此评估,我们提出了一个延迟预测框架,该框架通过在各种设置和许多硬件设备下开发操作延迟预测因子来解决这些挑战,并具有多核CPU和GPU,在端到端的延迟预测中实现了高准确性,如我们的全面评估所示。为了说明我们的方法不需要昂贵的数据收集,我们还表明,仅使用少量分析数据就可以在现实世界中实现准确的预测。
With the growing workload of inference tasks on mobile devices, state-of-the-art neural architectures (NAs) are typically designed through Neural Architecture Search (NAS) to identify NAs with good tradeoffs between accuracy and efficiency (e.g., latency). Since measuring the latency of a huge set of candidate architectures during NAS is not scalable, approaches are needed for predicting end-to-end inference latency on mobile devices. Such predictions are challenging due to hardware heterogeneity, optimizations applied by ML frameworks, and the diversity of neural architectures. Motivated by these challenges, in this paper, we first quantitatively assess characteristics of neural architectures and mobile devices that have significant effects on inference latency. Based on this assessment, we propose a latency prediction framework which addresses these challenges by developing operation-wise latency predictors, under a variety of settings and a number of hardware devices, with multi-core CPUs and GPUs, achieving high accuracy in end-to-end latency prediction, as shown by our comprehensive evaluations. To illustrate that our approach does not require expensive data collection, we also show that accurate predictions can be achieved on real-world NAs using only small amounts of profiling data.