延期：深神经网络的分布式边缘推断

论文标题

延期：深神经网络的分布式边缘推断

DEFER: Distributed Edge Inference for Deep Neural Networks

论文作者

Parthasarathy, Arjun, Krishnamachari, Bhaskar

论文摘要

深层神经网络（DNN）等现代机器学习工具在许多领域（例如自然语言处理，计算机视觉和物联网）中起着革命性的作用。一旦经过培训，就可以将深度学习模型部署在边缘计算机上，以对这些应用程序对实时数据进行分类和预测。特别是对于大型模型，单个边缘设备上的有限计算和内存资源可以成为推理管道的吞吐量瓶颈。为了增加吞吐量和减少每个设备计算负载，我们提出延期（分布式边缘推断），这是一个分布式边缘推理的框架，将深层神经网络分配到可以分布在多个计算节点的层中。该体系结构由一个单个“调度程序”节点组成，用于将DNN分区和推理数据分布到各自的计算节点。计算节点以串联模式连接，其中每个节点的计算结果都会中继到后续节点。然后将结果返回到调度员。我们使用核心网络模拟器在现实的网络条件下量化了我们框架的吞吐量，能耗，网络有效负载和间接费用。我们发现，对于RESNET50模型，使用8个计算节点的延期推理吞吐量高53％，并且每节点能量消耗比单个设备推断低63％。我们使用ZFP序列化和LZ4压缩算法进一步减少网络通信需求和能源消耗。我们已经使用TensorFlow和Keras ML库在Python中实施了DEFER，并发布了DEFER作为开源框架，以使研究社区受益。

Modern machine learning tools such as deep neural networks (DNNs) are playing a revolutionary role in many fields such as natural language processing, computer vision, and the internet of things. Once they are trained, deep learning models can be deployed on edge computers to perform classification and prediction on real-time data for these applications. Particularly for large models, the limited computational and memory resources on a single edge device can become the throughput bottleneck for an inference pipeline. To increase throughput and decrease per-device compute load, we present DEFER (Distributed Edge inFERence), a framework for distributed edge inference, which partitions deep neural networks into layers that can be spread across multiple compute nodes. The architecture consists of a single "dispatcher" node to distribute DNN partitions and inference data to respective compute nodes. The compute nodes are connected in a series pattern where each node's computed result is relayed to the subsequent node. The result is then returned to the Dispatcher. We quantify the throughput, energy consumption, network payload, and overhead for our framework under realistic network conditions using the CORE network emulator. We find that for the ResNet50 model, the inference throughput of DEFER with 8 compute nodes is 53% higher and per node energy consumption is 63% lower than single device inference. We further reduce network communication demands and energy consumption using the ZFP serialization and LZ4 compression algorithms. We have implemented DEFER in Python using the TensorFlow and Keras ML libraries, and have released DEFER as an open-source framework to benefit the research community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题