弹性异质计算资源的大规模知识蒸馏

论文标题

弹性异质计算资源的大规模知识蒸馏

Large-scale Knowledge Distillation with Elastic Heterogeneous Computing Resources

论文作者

Liu, Ji, Dong, Daxiang, Wang, Xi, Qin, An, Li, Xingjian, Valduriez, Patrick, Dou, Dejing, Yu, Dianhai

论文摘要

尽管更多的层和更多的参数通常提高了模型的准确性，但是这样的大型模型通常具有较高的计算复杂性，并且需要大记忆，这超过了小型设备的推理能力，并且会产生长时间的训练时间。此外，即使在高性能服务器中，也很难负担长期训练时间和大型模型的推理时间。作为将大型深层模型（教师模型）压缩为紧凑模型（学生模型）的有效方法，知识蒸馏是一种有前途的处理大型模型的方法。现有的知识蒸馏方法无法利用可用的弹性计算资源，并对应于低效率。在本文中，我们提出了一个用于知识蒸馏的弹性深度学习框架，即EDL-DIST。 EDL-DIST的优势是三倍。首先，推论和训练过程是分开的。其次，可以利用弹性可用的计算资源来提高效率。第三，支持训练和推理过程的故障耐受性。我们进行了广泛的实验，以表明EDL-DIST的吞吐量比基线方法（在线知识蒸馏）快3.125倍，而精度相似或更高。

Although more layers and more parameters generally improve the accuracy of the models, such big models generally have high computational complexity and require big memory, which exceed the capacity of small devices for inference and incurs long training time. In addition, it is difficult to afford long training time and inference time of big models even in high performance servers, as well. As an efficient approach to compress a large deep model (a teacher model) to a compact model (a student model), knowledge distillation emerges as a promising approach to deal with the big models. Existing knowledge distillation methods cannot exploit the elastic available computing resources and correspond to low efficiency. In this paper, we propose an Elastic Deep Learning framework for knowledge Distillation, i.e., EDL-Dist. The advantages of EDL-Dist are three-fold. First, the inference and the training process is separated. Second, elastic available computing resources can be utilized to improve the efficiency. Third, fault-tolerance of the training and inference processes is supported. We take extensive experimentation to show that the throughput of EDL-Dist is up to 3.125 times faster than the baseline method (online knowledge distillation) while the accuracy is similar or higher.

下载PDF全文

下载文献需遵守相关版权规定

论文标题