面向学生面向学生的教师网络培训以进行知识蒸馏

论文标题

面向学生面向学生的教师网络培训以进行知识蒸馏

Toward Student-Oriented Teacher Network Training For Knowledge Distillation

论文作者

Dong, Chengyu, Liu, Liyuan, Shang, Jingbo

论文摘要

如何进行教师培训以进行知识蒸馏仍然是一个空旷的问题。人们普遍认为，表现最好的老师不一定会产生表现最佳的学生，这表明当前的教师培训实践与理想的教师培训策略之间存在根本的差异。为了填补这一空白，我们探索了培训以经验风险最小化（ERM）为导向学生表现的老师的可行性。我们的分析灵感来自最近的发现，即知识蒸馏的有效性取决于教师近似培训输入的真正标签分布的能力。从理论上讲，只要学习者网络的特征提取器是Lipschitz的连续，并且对特征转换是可靠的，那么我们的ERM最小化器可以近似训练数据的真实标签分布。鉴于我们的理论，我们提出了一种教师培训方法，该方法将Lipschitz的正则化和一致性正则化融入ERM中。使用各种知识蒸馏算法和教师成对的基准数据集进行实验，证实Sotecher可以始终如一地提高学生的准确性。

How to conduct teacher training for knowledge distillation is still an open problem. It has been widely observed that a best-performing teacher does not necessarily yield the best-performing student, suggesting a fundamental discrepancy between the current teacher training practice and the ideal teacher training strategy. To fill this gap, we explore the feasibility of training a teacher that is oriented toward student performance with empirical risk minimization (ERM). Our analyses are inspired by the recent findings that the effectiveness of knowledge distillation hinges on the teacher's capability to approximate the true label distribution of training inputs. We theoretically establish that the ERM minimizer can approximate the true label distribution of training data as long as the feature extractor of the learner network is Lipschitz continuous and is robust to feature transformations. In light of our theory, we propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM. Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.

下载PDF全文

下载文献需遵守相关版权规定

论文标题