尊重知识蒸馏的转移差距

论文标题

尊重知识蒸馏的转移差距

Respecting Transfer Gap in Knowledge Distillation

论文作者

Niu, Yulei, Chen, Long, Zhou, Chang, Zhang, Hanwang

论文摘要

知识蒸馏（KD）本质上是将教师模型的行为（例如网络响应）转移到学生模型的过程。网络响应是制定机器域的附加监督，该响应将从人类域收集的数据作为转移集。传统的KD方法具有一个基本假设，即人类域和机器领域中收集的数据既独立又分布（IID）。我们指出，这种天真的假设是不现实的，并且确实存在两个域之间的转移差距。尽管差距从机器领域提供了学生模型的外部知识，但不平衡的教师知识会使我们估计在非IID转移集中每个样本中从老师到学生的转移数量。为了应对这一挑战，我们提出了反概率加权蒸馏（IPWD），以估计属于机器域的训练样本的倾向分数，并分配其倒数数量以补偿代表性不足的样本。在CIFAR-100和Imagenet上进行的实验证明了IPWD在两阶段蒸馏和一阶段的自我验证方面的有效性。

Knowledge distillation (KD) is essentially a process of transferring a teacher model's behavior, e.g., network response, to a student model. The network response serves as additional supervision to formulate the machine domain, which uses the data collected from the human domain as a transfer set. Traditional KD methods hold an underlying assumption that the data collected in both human domain and machine domain are both independent and identically distributed (IID). We point out that this naive assumption is unrealistic and there is indeed a transfer gap between the two domains. Although the gap offers the student model external knowledge from the machine domain, the imbalanced teacher knowledge would make us incorrectly estimate how much to transfer from teacher to student per sample on the non-IID transfer set. To tackle this challenge, we propose Inverse Probability Weighting Distillation (IPWD) that estimates the propensity score of a training sample belonging to the machine domain, and assigns its inverse amount to compensate for under-represented samples. Experiments on CIFAR-100 and ImageNet demonstrate the effectiveness of IPWD for both two-stage distillation and one-stage self-distillation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题