结构知识蒸馏：结构化预测变量的典型蒸馏信息

论文标题

结构知识蒸馏：结构化预测变量的典型蒸馏信息

Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor

论文作者

Wang, Xinyu, Jiang, Yong, Yan, Zhaohui, Jia, Zixia, Bach, Nguyen, Wang, Tao, Huang, Zhongqiang, Huang, Fei, Tu, Kewei

论文摘要

知识蒸馏是一种至关重要的技术，可以在模型之间转移知识，通常从大型模型（教师）转移到更细粒度的（学生）。知识蒸馏的目标函数通常是教师和学生的产出分布之间的跨凝结。但是，对于结构化的预测问题，输出空间的大小为指数。因此，跨透明镜的目标是直接计算和优化的棘手。在本文中，我们得出了一种对结构化预测的知识蒸馏目标的分解形式，这对于教师和学生模型的许多典型选择都是可以解决的。特别是，我们在四种不同的情况下显示了序列标记和依赖解析模型之间结构知识蒸馏的障碍性和经验有效性：1）教师和学生在输出结构评分函数的相同分解形式中具有相同的分解形式； 2）学生分解产生比教师分解更多的细粒子结构； 3）教师分解产生比学生分解更多的细粒子结构； 4）教师和学生的分解形式是不兼容的。

Knowledge distillation is a critical technique to transfer knowledge between models, typically from a large model (the teacher) to a more fine-grained one (the student). The objective function of knowledge distillation is typically the cross-entropy between the teacher and the student's output distributions. However, for structured prediction problems, the output space is exponential in size; therefore, the cross-entropy objective becomes intractable to compute and optimize directly. In this paper, we derive a factorized form of the knowledge distillation objective for structured prediction, which is tractable for many typical choices of the teacher and student models. In particular, we show the tractability and empirical effectiveness of structural knowledge distillation between sequence labeling and dependency parsing models under four different scenarios: 1) the teacher and student share the same factorization form of the output structure scoring function; 2) the student factorization produces more fine-grained substructures than the teacher factorization; 3) the teacher factorization produces more fine-grained substructures than the student factorization; 4) the factorization forms from the teacher and the student are incompatible.

下载PDF全文

下载文献需遵守相关版权规定

论文标题