论文标题
可切换的在线知识蒸馏
Switchable Online Knowledge Distillation
论文作者
论文摘要
在线知识蒸馏(OKD)通过相互利用教师和学生之间的差异来改善所涉及的模型。在它们之间的差距上有几个关键的瓶颈 - 例如,为什么以及何时以及何时会损害表现,尤其是对学生的表现?如何量化教师和学生之间的差距? - 接受了有限的正式研究。在本文中,我们提出了可切换的在线知识蒸馏(Switokd),以回答这些问题。 Switokd的核心思想不是专注于测试阶段的准确性差距,而是通过两种模式之间的切换策略 - 专家模式(暂停教师保持学生学习模式)和学习模式(重新学习教师),以适应训练阶段的差距,即蒸馏差距。为了拥有适当的蒸馏差距,我们进一步设计了一个自适应开关阈值,该阈值提供了有关何时切换到学习模式或专家模式的正式标准,从而改善了学生的表现。同时,教师受益于我们的自适应切换阈值,并基本上与其他在线艺术保持同步。我们进一步将Switokd扩展到具有两个基础拓扑的多个网络。最后,广泛的实验和分析验证了Switokd在最新面前的分类的优点。我们的代码可在https://github.com/hfutqian/switokd上找到。
Online Knowledge Distillation (OKD) improves the involved models by reciprocally exploiting the difference between teacher and student. Several crucial bottlenecks over the gap between them -- e.g., Why and when does a large gap harm the performance, especially for student? How to quantify the gap between teacher and student? -- have received limited formal study. In this paper, we propose Switchable Online Knowledge Distillation (SwitOKD), to answer these questions. Instead of focusing on the accuracy gap at test phase by the existing arts, the core idea of SwitOKD is to adaptively calibrate the gap at training phase, namely distillation gap, via a switching strategy between two modes -- expert mode (pause the teacher while keep the student learning) and learning mode (restart the teacher). To possess an appropriate distillation gap, we further devise an adaptive switching threshold, which provides a formal criterion as to when to switch to learning mode or expert mode, and thus improves the student's performance. Meanwhile, the teacher benefits from our adaptive switching threshold and keeps basically on a par with other online arts. We further extend SwitOKD to multiple networks with two basis topologies. Finally, extensive experiments and analysis validate the merits of SwitOKD for classification over the state-of-the-arts. Our code is available at https://github.com/hfutqian/SwitOKD.