非自动回归产生的自定进度的混合蒸馏方法

论文标题

非自动回归产生的自定进度的混合蒸馏方法

A Self-Paced Mixed Distillation Method for Non-Autoregressive Generation

论文作者

Qi, Weizhen, Gong, Yeyun, Shen, Yelong, Jiao, Jian, Yan, Yu, Li, Houqiang, Zhang, Ruofei, Chen, Weizhu, Duan, Nan

论文摘要

非自动性产生是一个序列产生范式，它消除了目标令牌之间的依赖性。它可以通过平行解码代替代替令牌的顺序解码有效地减少文本生成潜伏期。但是，由于已知的多模式问题，非自动回传（NAR）模型在各种语言生成任务上的表现不足自动回归（AR）模型。在NAR模型中，BANG是第一个关于英语未标记的原始文本语料库的大规模预训练模型。它将不同的一代范式视为其预训练的任务，包括自动回归（AR），非自动回应（NAR）以及具有多流策略的半自动回旋（半夜）信息流。它在没有任何蒸馏技术的情况下实现了最先进的性能。但是，AR蒸馏已被证明是改善NAR性能的非常有效的解决方案。在本文中，我们提出了一种新型的自定进度混合蒸馏方法，以进一步提高爆炸的发电质量。首先，我们提出了基于AR流知识的混合蒸馏策略。其次，我们鼓励模型通过自定进度学习专注于具有相同方式的样本。提出的自定进度的混合蒸馏算法提高了发电质量，对推理潜伏期没有影响。我们进行了有关摘要和问题生成任务的广泛实验，以验证有效性。为了进一步说明我们方法的商业价值，我们在现实世界广告应用程序中对三代任务进行了实验。商业数据的实验结果显示了所提出的模型的有效性。与BANG相比，它可以提高BLEU得分。另一方面，与自动回归生成方法相比，它的速度超过7倍。

Non-Autoregressive generation is a sequence generation paradigm, which removes the dependency between target tokens. It could efficiently reduce the text generation latency with parallel decoding in place of token-by-token sequential decoding. However, due to the known multi-modality problem, Non-Autoregressive (NAR) models significantly under-perform Auto-regressive (AR) models on various language generation tasks. Among the NAR models, BANG is the first large-scale pre-training model on English un-labeled raw text corpus. It considers different generation paradigms as its pre-training tasks including Auto-regressive (AR), Non-Autoregressive (NAR), and semi-Non-Autoregressive (semi-NAR) information flow with multi-stream strategy. It achieves state-of-the-art performance without any distillation techniques. However, AR distillation has been shown to be a very effective solution for improving NAR performance. In this paper, we propose a novel self-paced mixed distillation method to further improve the generation quality of BANG. Firstly, we propose the mixed distillation strategy based on the AR stream knowledge. Secondly, we encourage the model to focus on the samples with the same modality by self-paced learning. The proposed self-paced mixed distillation algorithm improves the generation quality and has no influence on the inference latency. We carry out extensive experiments on summarization and question generation tasks to validate the effectiveness. To further illustrate the commercial value of our approach, we conduct experiments on three generation tasks in real-world advertisements applications. Experimental results on commercial data show the effectiveness of the proposed model. Compared with BANG, it achieves significant BLEU score improvement. On the other hand, compared with auto-regressive generation method, it achieves more than 7x speedup.

下载PDF全文

下载文献需遵守相关版权规定

论文标题