论文标题

混合回归神经机器翻译

Hybrid-Regressive Neural Machine Translation

论文作者

Wang, Qiang, Hu, Xinhui, Chen, Ming

论文摘要

在这项工作中,我们从经验上证实,具有迭代精致机制(IR-NAT)的非自动回旋翻译的加速度稳健性差,因为它对解码批处理大小和计算设备设置的敏感性比自动式翻译(AT)更敏感。受到它的启发,我们试图研究如何更好地结合自回归和非自动回调的翻译范式的优势。为此,我们通过合成实验证明,促使少数AT的预测可以促进一击非自动回旋翻译以实现IR-NAT的等效性能。遵循这一行,我们提出了一个新的两阶段翻译原型,称为杂交回归翻译(HRT)。具体而言,HRT首先通过自动估计生成不连续的序列(例如,对每个k令牌进行预测,k> 1),然后以非自动性进展方式一次填充所有先前跳过的令牌。我们还建议一袋技术,以有效,有效地训练HRT,而无需添加任何模型参数。 HRT在WMT EN-DE任务上达到28.49的最新BLEU得分,无论批处理尺寸和设备如何,至少要比以下速度快1.5倍。此外,HRT的另一个好处是,它成功继承了AT在Deep-Soder-Shallow-Decoder架构中的良好特征。具体地,与具有6层编码器和6层解码器的香草HRT相比,具有12层编码器和1层解码器的HRT的推理速度在GPU和CPU上进一步增加了一倍。

In this work, we empirically confirm that non-autoregressive translation with an iterative refinement mechanism (IR-NAT) suffers from poor acceleration robustness because it is more sensitive to decoding batch size and computing device setting than autoregressive translation (AT). Inspired by it, we attempt to investigate how to combine the strengths of autoregressive and non-autoregressive translation paradigms better. To this end, we demonstrate through synthetic experiments that prompting a small number of AT's predictions can promote one-shot non-autoregressive translation to achieve the equivalent performance of IR-NAT. Following this line, we propose a new two-stage translation prototype called hybrid-regressive translation (HRT). Specifically, HRT first generates discontinuous sequences via autoregression (e.g., make a prediction every k tokens, k>1) and then fills in all previously skipped tokens at once in a non-autoregressive manner. We also propose a bag of techniques to effectively and efficiently train HRT without adding any model parameters. HRT achieves the state-of-the-art BLEU score of 28.49 on the WMT En-De task and is at least 1.5x faster than AT, regardless of batch size and device. In addition, another bonus of HRT is that it successfully inherits the good characteristics of AT in the deep-encoder-shallow-decoder architecture. Concretely, compared to the vanilla HRT with a 6-layer encoder and 6-layer decoder, the inference speed of HRT with a 12-layer encoder and 1-layer decoder is further doubled on both GPU and CPU without BLEU loss.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源