论文标题
稀疏的老师可能会熟悉知识
Sparse Teachers Can Be Dense with Knowledge
论文作者
论文摘要
蒸馏预处理的语言模型的最新进展发现,除了知识的表现力外,还应考虑到学生友好性,以实现真正知识渊博的老师。基于一项试点研究,我们发现过度参数化的教师可以产生表达但对学生不友好的知识,因此在整体知识方面受到限制。为了删除导致学生不满的参数,我们在每个教师参数的总体知识分数的指导下提出了一个稀疏的教师技巧。知识渊博的分数本质上是表达和学生友好分数的插值。目的是确保在删除学生不友好的情况下保留表达参数。胶水基准的广泛实验表明,与一系列竞争性基线相比,提出的稀疏教师可能会熟悉知识,并导致具有令人信服的表现的学生。
Recent advances in distilling pretrained language models have discovered that, besides the expressiveness of knowledge, the student-friendliness should be taken into consideration to realize a truly knowledgable teacher. Based on a pilot study, we find that over-parameterized teachers can produce expressive yet student-unfriendly knowledge and are thus limited in overall knowledgableness. To remove the parameters that result in student-unfriendliness, we propose a sparse teacher trick under the guidance of an overall knowledgable score for each teacher parameter. The knowledgable score is essentially an interpolation of the expressiveness and student-friendliness scores. The aim is to ensure that the expressive parameters are retained while the student-unfriendly ones are removed. Extensive experiments on the GLUE benchmark show that the proposed sparse teachers can be dense with knowledge and lead to students with compelling performance in comparison with a series of competitive baselines.