论文标题
部分可观测时空混沌系统的无模型预测
DistPro: Searching A Fast Knowledge Distillation Process via Meta Optimization
论文作者
论文摘要
最近的知识蒸馏(KD)研究表明,不同的手动设计方案对学习的结果产生了重大影响。然而,在KD中,自动搜索最佳蒸馏方案尚未得到很好的探索。在本文中,我们提出了DistPro,这是一个新颖的框架,该框架通过可区分的元学习来搜索最佳的KD过程。具体而言,鉴于一对学生和教师网络,DistPro首先设置了从教师的传输层到学生的接收层的丰富KD连接,同时,还提出了各种转换,以比较沿其途径的特征图进行蒸馏。然后,连接和转换选择(途径)的每种组合都与随机加权过程相关联,该过程表明其在蒸馏过程中的每个步骤中的重要性。在搜索阶段,可以通过我们提出的双层元优化策略有效地学习该过程。在蒸馏阶段,DistPro采用了学习的过程进行知识蒸馏,这大大提高了学生的准确性,尤其是在需要更快的训练时。最后,我们发现学习的过程可以在相似的任务和网络之间推广。在我们的实验中,DISTPRO在流行数据集的不同学习时期(即Cifar100和Imagenet)上产生最先进的(SOTA)精度,这证明了我们框架的有效性。
Recent Knowledge distillation (KD) studies show that different manually designed schemes impact the learned results significantly. Yet, in KD, automatically searching an optimal distillation scheme has not yet been well explored. In this paper, we propose DistPro, a novel framework which searches for an optimal KD process via differentiable meta-learning. Specifically, given a pair of student and teacher networks, DistPro first sets up a rich set of KD connection from the transmitting layers of the teacher to the receiving layers of the student, and in the meanwhile, various transforms are also proposed for comparing feature maps along its pathway for the distillation. Then, each combination of a connection and a transform choice (pathway) is associated with a stochastic weighting process which indicates its importance at every step during the distillation. In the searching stage, the process can be effectively learned through our proposed bi-level meta-optimization strategy. In the distillation stage, DistPro adopts the learned processes for knowledge distillation, which significantly improves the student accuracy especially when faster training is required. Lastly, we find the learned processes can be generalized between similar tasks and networks. In our experiments, DistPro produces state-of-the-art (SoTA) accuracy under varying number of learning epochs on popular datasets, i.e. CIFAR100 and ImageNet, which demonstrate the effectiveness of our framework.