论文标题
通过关系匹配的广义知识蒸馏
Generalized Knowledge Distillation via Relationship Matching
论文作者
论文摘要
训练有素的深神经网络(又称“老师”)的知识对于学习类似任务很有价值。知识蒸馏将知识从教师中提取,并将其与目标模型(又称“学生”)相结合,该模型扩大了学生的知识并提高了学习效率。我们没有强制老师从事与学生相同的任务工作,而是从从一般标签空间训练的老师那里借用知识 - 在这种“广义知识蒸馏(GKD)”中,老师和学生的班级可能是相同的,完全不同的,或部分重叠。我们声称实例之间的比较能力是跨任务的重要因素线程知识的重要因素,并提出了促进了当地分类器蒸馏(补充)方法的关系,该方法将嵌入式和顶层分类器的GKD流动。特别是,与核对模型之间的实例标签信心不同,重新填充要求老师重新重量的坚硬元素由学生向前推进,然后匹配实例之间的相似性比较水平。基于教师模型的嵌入式诱导的分类器监督学生的分类信心,并适应地强调了老师最相关的监督。当教师的班级从相同到完全不重叠的套装W.R.T.时,补充表现出强大的歧视能力。学生。它还可以在标准知识蒸馏,一步增量学习和几乎没有成功的学习任务上实现最先进的表现。
The knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks. Knowledge distillation extracts knowledge from the teacher and integrates it with the target model (a.k.a. the "student"), which expands the student's knowledge and improves its learning efficacy. Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space -- in this "Generalized Knowledge Distillation (GKD)", the classes of the teacher and the student may be the same, completely different, or partially overlapped. We claim that the comparison ability between instances acts as an essential factor threading knowledge across tasks, and propose the RElationship FacIlitated Local cLassifiEr Distillation (REFILLED) approach, which decouples the GKD flow of the embedding and the top-layer classifier. In particular, different from reconciling the instance-label confidence between models, REFILLED requires the teacher to reweight the hard tuples pushed forward by the student and then matches the similarity comparison levels between instances. An embedding-induced classifier based on the teacher model supervises the student's classification confidence and adaptively emphasizes the most related supervision from the teacher. REFILLED demonstrates strong discriminative ability when the classes of the teacher vary from the same to a fully non-overlapped set w.r.t. the student. It also achieves state-of-the-art performance on standard knowledge distillation, one-step incremental learning, and few-shot learning tasks.