流体批处理：Edge NPU的早期神经网络的出口预先征用

论文标题

流体批处理：Edge NPU的早期神经网络的出口预先征用

Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural Networks on Edge NPUs

论文作者

Kouris, Alexandros, Venieris, Stylianos I., Laskaridis, Stefanos, Lane, Nicholas D.

论文摘要

随着深度神经网络（DNN）在多种计算机视觉任务中成为骨干，它们在现实世界应用中的采用不断扩大。 Given the abundance and omnipresence of smart devices in the consumer landscape, "smart ecosystems'' are being formed where sensing happens concurrently rather than standalone. This is shifting the on-device inference paradigm towards deploying centralised neural processing units (NPUs) at the edge, where multiple devices (e.g. in smart homes or autonomous vehicles) can stream their data for processing with dynamic rates. While这为输入批处理提供了增强的潜力，幼稚的解决方案可以使经验的表现和质量降低，尤其是在峰值负载下，同时部署动态DNN，包括构成随机计算图（例如早期 - EXIT（EE））的模型。在运行时的样本预先抢占，以考虑到到达和早期访问过程中引入的动态。平均1.97倍和6.7倍的改善，分别比最先进的DNN流媒体系统，分别为平均延迟和尾部潜伏期服务水平目标（SLO）满意度。

With deep neural networks (DNNs) emerging as the backbone in a multitude of computer vision tasks, their adoption in real-world applications broadens continuously. Given the abundance and omnipresence of smart devices in the consumer landscape, "smart ecosystems'' are being formed where sensing happens concurrently rather than standalone. This is shifting the on-device inference paradigm towards deploying centralised neural processing units (NPUs) at the edge, where multiple devices (e.g. in smart homes or autonomous vehicles) can stream their data for processing with dynamic rates. While this provides enhanced potential for input batching, naive solutions can lead to subpar performance and quality of experience, especially under spiking loads. At the same time, the deployment of dynamic DNNs, comprising stochastic computation graphs (e.g. early-exit (EE) models), introduces a new dimension of dynamic behaviour in such systems. In this work, we propose a novel early-exit-aware scheduling algorithm that allows sample preemption at run time, to account for the dynamicity introduced both by the arrival and early-exiting processes. At the same time, we introduce two novel dimensions to the design space of the NPU hardware architecture, namely Fluid Batching and Stackable Processing Elements, that enable run-time adaptability to different batch sizes and significantly improve the NPU utilisation even at small batches. Our evaluation shows that the proposed system achieves an average 1.97x and 6.7x improvement over state-of-the-art DNN streaming systems in terms of average latency and tail latency service-level objective (SLO) satisfaction, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题