论文标题
本地线性区域知识蒸馏
Locally Linear Region Knowledge Distillation
论文作者
论文摘要
知识蒸馏(KD)是一种将知识从一个神经网络(教师)转移到另一个(学生)的有效技术,从而提高了学生的表现。为了使学生更好地模仿老师的行为,现有工作着重于设计不同的标准以使其逻辑或表示形式保持一致。与这些努力不同,我们从新的数据角度解决了知识蒸馏。我们认为,在稀疏培训数据点上转移知识不能使学生能够很好地捕捉教师功能的本地形状。为了解决这个问题,我们建议本地线性区域知识蒸馏($ \ rm l^2 $ rkd),将本地线性区域的知识从老师转移到学生。这是通过实施学生在本地线性区域中模仿教师功能的输出来实现的。最后,学生能够更好地捕获教师功能的本地形状,从而取得更好的表现。尽管它很简单,但广泛的实验表明,$ \ rm l^2 $ rkd在许多方面都优于原始KD,因为它的表现优于KD和其他最先进的方法,并且通过很大的利润表现出稳健性和优越性,并且在几乎没有发动机的环境下与现有的蒸馏措施更加兼容,以提高他们的效果。
Knowledge distillation (KD) is an effective technique to transfer knowledge from one neural network (teacher) to another (student), thus improving the performance of the student. To make the student better mimic the behavior of the teacher, the existing work focuses on designing different criteria to align their logits or representations. Different from these efforts, we address knowledge distillation from a novel data perspective. We argue that transferring knowledge at sparse training data points cannot enable the student to well capture the local shape of the teacher function. To address this issue, we propose locally linear region knowledge distillation ($\rm L^2$RKD) which transfers the knowledge in local, linear regions from a teacher to a student. This is achieved by enforcing the student to mimic the outputs of the teacher function in local, linear regions. To the end, the student is able to better capture the local shape of the teacher function and thus achieves a better performance. Despite its simplicity, extensive experiments demonstrate that $\rm L^2$RKD is superior to the original KD in many aspects as it outperforms KD and the other state-of-the-art approaches by a large margin, shows robustness and superiority under few-shot settings, and is more compatible with the existing distillation approaches to further improve their performances significantly.