神经机器翻译的学习来源短语表示

论文标题

神经机器翻译的学习来源短语表示

Learning Source Phrase Representations for Neural Machine Translation

论文作者

Xu, Hongfei, van Genabith, Josef, Xiong, Deyi, Liu, Qiuhui, Zhang, Jingyi

论文摘要

基于多头注意机制的变压器翻译模型（Vaswani等，2017）可以并行计算，并显着推动了神经机器翻译（NMT）的性能。尽管与RNN相比，直觉上的注意力网络可以通过较短的网络路径连接遥远的单词，但经验分析表明，它仍然难以完全捕获长距离依赖关系（Tang等，2018）。考虑到建模短语而不是单词可以通过使用较大的翻译块（“短语”）及其重新排序能力来显着改善统计机器翻译（SMT）方法，在短语级别上对NMT进行建模是一个直觉的建议，是帮助模型捕获长距离关系的直觉建议。在本文中，我们首先提出了一个细心的短语表示生成机制，该机制能够从相应的令牌表示中生成短语表示。此外，我们将生成的短语表示形式纳入变压器翻译模型中，以增强其捕获长距离关系的能力。在我们的实验中，我们在强大的变压器基线之外，对WMT 14英语 - 德国和英语法国任务进行了重大改进，这表明了我们方法的有效性。我们的方法有助于变形金刚基本模型在变压器大型模型的级别上执行，甚至对于长句子而言，甚至要少得多，但参数和训练步骤较少。即使在大环境中，短语表示也有助于这一事实进一步支持我们的猜想，即它们对长距离关系做出了宝贵的贡献。

The Transformer translation model (Vaswani et al., 2017) based on a multi-head attention mechanism can be computed effectively in parallel and has significantly pushed forward the performance of Neural Machine Translation (NMT). Though intuitively the attentional network can connect distant words via shorter network paths than RNNs, empirical analysis demonstrates that it still has difficulty in fully capturing long-distance dependencies (Tang et al., 2018). Considering that modeling phrases instead of words has significantly improved the Statistical Machine Translation (SMT) approach through the use of larger translation blocks ("phrases") and its reordering ability, modeling NMT at phrase level is an intuitive proposal to help the model capture long-distance relationships. In this paper, we first propose an attentive phrase representation generation mechanism which is able to generate phrase representations from corresponding token representations. In addition, we incorporate the generated phrase representations into the Transformer translation model to enhance its ability to capture long-distance relationships. In our experiments, we obtain significant improvements on the WMT 14 English-German and English-French tasks on top of the strong Transformer baseline, which shows the effectiveness of our approach. Our approach helps Transformer Base models perform at the level of Transformer Big models, and even significantly better for long sentences, but with substantially fewer parameters and training steps. The fact that phrase representations help even in the big setting further supports our conjecture that they make a valuable contribution to long-distance relations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题