TASO：用于内存约束DNN推断的时间和空间优化

论文标题

TASO：用于内存约束DNN推断的时间和空间优化

TASO: Time and Space Optimization for Memory-Constrained DNN Inference

论文作者

Wen, Yuan, Anderson, Andrew, Radu, Valentin, O'Boyle, Michael F. P., Gregg, David

论文摘要

卷积神经网络（CNN）用于许多嵌入式应用中，从工业机器人和自动化系统到移动设备上的生物识别识别。最先进的分类通常是由大型网络实现的，在移动和嵌入式设备上运行的内存和能量预算严格的设备非常昂贵。我们基于整数线性编程（ILP）提前域模型提前域特异性优化的方法，用于选择原始操作以实现卷积层。我们通过以下方式优化了执行时间和内存消耗之间的权衡：1）尝试通过选择数据布局和原始操作来实现每一层，以最大程度地减少整个网络的执行时间； 2）分配一个适当的工作空间，该工作空间反映每层内存足迹的上限。这两种优化策略可用于使用C编译器在任何平台上运行任何CNN。 Our evaluation with a range of popular ImageNet neural architectures (GoogleNet, AlexNet, VGG, ResNet and SqueezeNet) on the ARM Cortex-A15 yields speedups of 8x compared to a greedy algorithm based primitive selection, reduces memory requirement by 2.2x while sacrificing only 15% of inference time compared to a solver that considers inference time only.此外，我们的优化方法公开了一系列最佳点，用于在内存和延迟权衡的帕累托前沿的不同配置，这些配置可在任意系统约束下使用。

Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices. State-of-the-art classification is typically achieved by large networks, which are prohibitively expensive to run on mobile and embedded devices with tightly constrained memory and energy budgets. We propose an approach for ahead-of-time domain specific optimization of CNN models, based on an integer linear programming (ILP) for selecting primitive operations to implement convolutional layers. We optimize the trade-off between execution time and memory consumption by: 1) attempting to minimize execution time across the whole network by selecting data layouts and primitive operations to implement each layer; and 2) allocating an appropriate workspace that reflects the upper bound of memory footprint per layer. These two optimization strategies can be used to run any CNN on any platform with a C compiler. Our evaluation with a range of popular ImageNet neural architectures (GoogleNet, AlexNet, VGG, ResNet and SqueezeNet) on the ARM Cortex-A15 yields speedups of 8x compared to a greedy algorithm based primitive selection, reduces memory requirement by 2.2x while sacrificing only 15% of inference time compared to a solver that considers inference time only. In addition, our optimization approach exposes a range of optimal points for different configurations across the Pareto frontier of memory and latency trade-off, which can be used under arbitrary system constraints.

下载PDF全文

下载文献需遵守相关版权规定

论文标题