论文标题
相邻的网络:具有网络压缩应用程序的培训范式
Adjoined Networks: A Training Paradigm with Applications to Network Compression
论文作者
论文摘要
当我们想在生产和/或边缘设备中部署大型,强大的模型时,在保持准确性的同时压缩深层神经网络很重要。用于实现这一目标的一种常见技术是知识蒸馏。通常,静态预定义的教师(大型基础网络)的输出用作软标签,以训练和将信息传输到学生(或较小)网络。在本文中,我们介绍了相邻的网络或一个学习范式,该范式同时训练原始的基本网络和较小的压缩网络。在我们的培训方法中,较小网络的参数在基础和压缩网络上共享。使用我们的培训范式,我们可以同时压缩(学生网络)并正规化(教师网络)任何体系结构。在本文中,我们专注于用于计算机视觉任务的流行基于CNN的架构。我们对各种大型数据集的培训范式进行了广泛的实验评估。使用Resnet-50作为基本网络,在ImageNet数据集上仅使用180万参数和1.6 GFLOPS实现71.8%的TOP-1精度。我们进一步提出了可区分的相邻网络(DAN),这是一种训练范式,通过使用神经体系结构搜索来共同了解较小网络的每一层的宽度和权重。 Dan以$ 3.8 \ times $ $较少的参数和$ 2.2 \ times $ $较少的触发器,在Imagenet上实现Resnet-50级别精度。
Compressing deep neural networks while maintaining accuracy is important when we want to deploy large, powerful models in production and/or edge devices. One common technique used to achieve this goal is knowledge distillation. Typically, the output of a static pre-defined teacher (a large base network) is used as soft labels to train and transfer information to a student (or smaller) network. In this paper, we introduce Adjoined Networks, or AN, a learning paradigm that trains both the original base network and the smaller compressed network together. In our training approach, the parameters of the smaller network are shared across both the base and the compressed networks. Using our training paradigm, we can simultaneously compress (the student network) and regularize (the teacher network) any architecture. In this paper, we focus on popular CNN-based architectures used for computer vision tasks. We conduct an extensive experimental evaluation of our training paradigm on various large-scale datasets. Using ResNet-50 as the base network, AN achieves 71.8% top-1 accuracy with only 1.8M parameters and 1.6 GFLOPs on the ImageNet data-set. We further propose Differentiable Adjoined Networks (DAN), a training paradigm that augments AN by using neural architecture search to jointly learn both the width and the weights for each layer of the smaller network. DAN achieves ResNet-50 level accuracy on ImageNet with $3.8\times$ fewer parameters and $2.2\times$ fewer FLOPs.