论文标题
通过图层映射搜索改善任务不合时宜的BERT蒸馏
Improving Task-Agnostic BERT Distillation with Layer Mapping Search
论文作者
论文摘要
知识蒸馏(KD)将知识从大型教师模型转移到小型学生模型,最近被广泛用于压缩BERT模型。除了原始KD的输出中的监督外,最近的作品还表明,层级的监督对于学生BERT模型的性能至关重要。但是,以前的作品以启发式(例如统一或最后一层)的启发式设计了层映射策略,这可能会导致性能较低。在本文中,我们建议使用遗传算法(GA)自动搜索最佳层映射。为了加速搜索过程,我们进一步提出了一个代理设置,其中对训练语料库的一小部分进行蒸馏,并选择三个代表性任务进行评估。在获得最佳层映射后,我们在整个语料库上使用它执行任务无关的BERT蒸馏,以构建紧凑的学生模型,可以直接在下游任务上进行微调。评估基准的全面实验表明,1)层映射策略对任务无关的BERT蒸馏具有重大影响,并且不同的层映射可能会导致截然不同的性能; 2)提出的搜索过程的最佳层映射策略始终优于其他启发式策略; 3)使用最佳层映射,我们的学生模型在胶水任务上实现了最新的性能。
Knowledge distillation (KD) which transfers the knowledge from a large teacher model to a small student model, has been widely used to compress the BERT model recently. Besides the supervision in the output in the original KD, recent works show that layer-level supervision is crucial to the performance of the student BERT model. However, previous works designed the layer mapping strategy heuristically (e.g., uniform or last-layer), which can lead to inferior performance. In this paper, we propose to use the genetic algorithm (GA) to search for the optimal layer mapping automatically. To accelerate the search process, we further propose a proxy setting where a small portion of the training corpus are sampled for distillation, and three representative tasks are chosen for evaluation. After obtaining the optimal layer mapping, we perform the task-agnostic BERT distillation with it on the whole corpus to build a compact student model, which can be directly fine-tuned on downstream tasks. Comprehensive experiments on the evaluation benchmarks demonstrate that 1) layer mapping strategy has a significant effect on task-agnostic BERT distillation and different layer mappings can result in quite different performances; 2) the optimal layer mapping strategy from the proposed search process consistently outperforms the other heuristic ones; 3) with the optimal layer mapping, our student model achieves state-of-the-art performance on the GLUE tasks.