知识蒸馏符合自学

论文标题

知识蒸馏符合自学

Knowledge Distillation Meets Self-Supervision

论文作者

Xu, Guodong, Liu, Ziwei, Li, Xiaoxiao, Loy, Chen Change

论文摘要

知识蒸馏涉及从教师网络中提取“黑暗知识”以指导学生网络的学习，它已成为模型压缩和转移学习的重要技术。与以前利用特定于架构的提示（例如激活和蒸馏的关注）的作品不同，我们希望探索一种从预训练的教师模型中提取“更丰富的黑暗知识”的更通用和模型的方法。我们表明，看似不同的自我实施任务可以用作简单而强大的解决方案。例如，在转化实体之间进行对比学习时，教师网络的嘈杂预测反映了其语义和姿势信息的内在组成。通过将这些自我判断信号作为一项辅助任务之间的相似性，可以有效地将隐藏的信息从老师转移到学生。在本文中，我们讨论了利用那些具有选择性转移进行蒸馏的嘈杂的自主信号的实用方法。我们进一步表明，在几乎没有射击和嘈杂的标签场景下，自我划分信号改善了常规蒸馏。鉴于从自学意义中挖掘出的更丰富的知识，我们的知识蒸馏方法在标准基准（即Cifar100和Imagenet）上实现了最先进的性能，在相似的架构和交叉结构设置下。在跨架构设置下，这个优点甚至更为明显，在六个不同的教师成对上，我们的方法在CIFAR100上的准确率平均优于最先进的CRD状态。

Knowledge distillation, which involves extracting the "dark knowledge" from a teacher network to guide the learning of a student network, has emerged as an important technique for model compression and transfer learning. Unlike previous works that exploit architecture-specific cues such as activation and attention for distillation, here we wish to explore a more general and model-agnostic approach for extracting "richer dark knowledge" from the pre-trained teacher model. We show that the seemingly different self-supervision task can serve as a simple yet powerful solution. For example, when performing contrastive learning between transformed entities, the noisy predictions of the teacher network reflect its intrinsic composition of semantic and pose information. By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student. In this paper, we discuss practical ways to exploit those noisy self-supervision signals with selective transfer for distillation. We further show that self-supervision signals improve conventional distillation with substantial gains under few-shot and noisy-label scenarios. Given the richer knowledge mined from self-supervision, our knowledge distillation approach achieves state-of-the-art performance on standard benchmarks, i.e., CIFAR100 and ImageNet, under both similar-architecture and cross-architecture settings. The advantage is even more pronounced under the cross-architecture setting, where our method outperforms the state of the art CRD by an average of 2.3% in accuracy rate on CIFAR100 across six different teacher-student pairs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题