论文标题
亚类蒸馏
Subclass Distillation
论文作者
论文摘要
在对标有数据的大型“老师”神经网络进行了培训之后,老师指定错误课程的概率揭示了有关教师概括的方式的大量信息。通过训练一个小的“学生”模型以匹配这些概率,可以将教师的大部分概括能力转移到学生中,通常比直接培训学生在培训数据上的概括要多得多。在有很多可能的课程时,转移最有效,因为随后透露了更多关于老师学到的功能的信息,但是如果只有少数可能的课程,我们表明我们可以通过强迫教师将每个班级分为受监督培训期间发明的许多子类来改善转移。然后,对学生进行培训以匹配子类概率。对于已知的数据集,我们表明教师学习相似的子类并改善蒸馏。对于单击数据集,子类未知的数据集,我们证明了子类蒸馏使学生可以更快,更好地学习。
After a large "teacher" neural network has been trained on labeled data, the probabilities that the teacher assigns to incorrect classes reveal a lot of information about the way in which the teacher generalizes. By training a small "student" model to match these probabilities, it is possible to transfer most of the generalization ability of the teacher to the student, often producing a much better small model than directly training the student on the training data. The transfer works best when there are many possible classes because more is then revealed about the function learned by the teacher, but in cases where there are only a few possible classes we show that we can improve the transfer by forcing the teacher to divide each class into many subclasses that it invents during the supervised training. The student is then trained to match the subclass probabilities. For datasets where there are known, natural subclasses we demonstrate that the teacher learns similar subclasses and these improve distillation. For clickthrough datasets where the subclasses are unknown we demonstrate that subclass distillation allows the student to learn faster and better.