论文标题
从过去学习:体验合奏知识蒸馏
Learn From the Past: Experience Ensemble Knowledge Distillation
论文作者
论文摘要
传统的知识蒸馏将预训练的教师网络的“黑暗知识”转移到了学生网络,而忽略了教师培训过程中的知识,我们称之为教师的经验。但是,在现实的教育场景中,学习经验通常比学习结果更重要。在这项工作中,我们通过整合教师的知识转移经验,命名为“经验合奏知识蒸馏”(EEKD),提出了一种新颖的知识蒸馏方法。我们从统一的教师模型的训练过程中节省了中等数量的中间模型,然后通过集合技术整合了这些中间模型的知识。在知识转移过程中,自我发项模块用于自适应为不同的中间模型分配权重。探索了构建EEKD的三个原则,探索了中间模型的质量,权重和数量。一个令人惊讶的结论发现,强大的合奏老师不一定会培养有力的学生。 CIFAR-100和Imagenet的实验结果表明,EEKD的表现优于主流知识蒸馏方法,并实现了最新的方法。特别是,在节省培训成本的前提下,EEKD甚至超过了标准合奏蒸馏。
Traditional knowledge distillation transfers "dark knowledge" of a pre-trained teacher network to a student network, and ignores the knowledge in the training process of the teacher, which we call teacher's experience. However, in realistic educational scenarios, learning experience is often more important than learning results. In this work, we propose a novel knowledge distillation method by integrating the teacher's experience for knowledge transfer, named experience ensemble knowledge distillation (EEKD). We save a moderate number of intermediate models from the training process of the teacher model uniformly, and then integrate the knowledge of these intermediate models by ensemble technique. A self-attention module is used to adaptively assign weights to different intermediate models in the process of knowledge transfer. Three principles of constructing EEKD on the quality, weights and number of intermediate models are explored. A surprising conclusion is found that strong ensemble teachers do not necessarily produce strong students. The experimental results on CIFAR-100 and ImageNet show that EEKD outperforms the mainstream knowledge distillation methods and achieves the state-of-the-art. In particular, EEKD even surpasses the standard ensemble distillation on the premise of saving training cost.