深层神经网络压缩的平行块知识蒸馏

论文标题

深层神经网络压缩的平行块知识蒸馏

Parallel Blockwise Knowledge Distillation for Deep Neural Network Compression

论文作者

Blakeney, Cody, Li, Xiaomin, Yan, Yan, Zong, Ziliang

论文摘要

当今的自然语言处理，语音识别和计算机视觉方面，深度神经网络（DNN）在解决许多挑战性的AI任务方面非常成功。但是，DNN通常是计算密集型，内存需求和饥饿力量，这大大限制了其在平台上使用约束资源的用法。因此，已经提出了多种压缩技术（例如量化，修剪和知识蒸馏）来降低DNN的大小和功耗。块知识蒸馏是可以有效降低高度复杂DNN的大小的压缩技术之一。但是，由于训练时间很长，因此并未被广泛采用。在本文中，我们提出了一种新型的平行块蒸馏算法来加速复杂的DNN的蒸馏过程。我们的算法利用本地信息来进行独立的区块蒸馏，利用可分开的层作为有效的替换块体系结构，并正确地解决了影响并行性的限制因素（例如依赖性，同步和负载平衡）。在具有四个GEFORCE RTX 2080TI GPU的AMD服务器上运行的实验结果表明，我们的算法可以在VGG蒸馏中实现3倍加速度加上19％的能源节省，以及3.5倍的加速加上Resnet蒸馏节能节省29％的能源，这两者都具有可忽略的准确损失。在分布式群集中使用四个RTX6000 GPU时，可以将重新蒸馏的加速进一步提高到3.87。

Deep neural networks (DNNs) have been extremely successful in solving many challenging AI tasks in natural language processing, speech recognition, and computer vision nowadays. However, DNNs are typically computation intensive, memory demanding, and power hungry, which significantly limits their usage on platforms with constrained resources. Therefore, a variety of compression techniques (e.g. quantization, pruning, and knowledge distillation) have been proposed to reduce the size and power consumption of DNNs. Blockwise knowledge distillation is one of the compression techniques that can effectively reduce the size of a highly complex DNN. However, it is not widely adopted due to its long training time. In this paper, we propose a novel parallel blockwise distillation algorithm to accelerate the distillation process of sophisticated DNNs. Our algorithm leverages local information to conduct independent blockwise distillation, utilizes depthwise separable layers as the efficient replacement block architecture, and properly addresses limiting factors (e.g. dependency, synchronization, and load balancing) that affect parallelism. The experimental results running on an AMD server with four Geforce RTX 2080Ti GPUs show that our algorithm can achieve 3x speedup plus 19% energy savings on VGG distillation, and 3.5x speedup plus 29% energy savings on ResNet distillation, both with negligible accuracy loss. The speedup of ResNet distillation can be further improved to 3.87 when using four RTX6000 GPUs in a distributed cluster.

下载PDF全文

下载文献需遵守相关版权规定

论文标题