瓦尔哈拉：机器翻译的视觉幻觉

论文标题

瓦尔哈拉：机器翻译的视觉幻觉

VALHALLA: Visual Hallucination for Machine Translation

论文作者

Li, Yi, Panda, Rameswar, Kim, Yoon, Chen, Chun-Fu, Feris, Rogerio, Cox, David, Vasconcelos, Nuno

论文摘要

近年来，通过考虑辅助输入（例如图像）来设计更好的机器翻译系统。虽然现有方法在传统的纯文本翻译系统上表现出有希望的性能，但它们通常需要配对的文本和图像作为推理期间的输入，这将其适用性限制在现实世界中。在本文中，我们介绍了一个可视化幻觉框架，称为Valhalla，该框架仅在推理时需要源句子，而是使用幻觉的视觉表示来进行多模式机器翻译。特别地，给定源句子，自回归幻觉变压器用于预测输入文本的离散视觉表示，并且使用了组合的文本和幻觉表示来获得目标翻译。我们使用标准反向传播和跨侧面损失的标准反向传播共同训练幻觉变压器，同时受到额外损失的指导，该损失鼓励使用地面真相或幻觉视觉表示之间的预测之间的一致性。在三个具有多种语言对的标准翻译数据集上进行了广泛的实验，证明了我们方法对仅文本基线和最新方法的有效性。项目页面：http：//www.svcl.ucsd.edu/projects/valhalla。

Designing better machine translation systems by considering auxiliary inputs such as images has attracted much attention in recent years. While existing methods show promising performance over the conventional text-only translation systems, they typically require paired text and image as input during inference, which limits their applicability to real-world scenarios. In this paper, we introduce a visual hallucination framework, called VALHALLA, which requires only source sentences at inference time and instead uses hallucinated visual representations for multimodal machine translation. In particular, given a source sentence an autoregressive hallucination transformer is used to predict a discrete visual representation from the input text, and the combined text and hallucinated representations are utilized to obtain the target translation. We train the hallucination transformer jointly with the translation transformer using standard backpropagation with cross-entropy losses while being guided by an additional loss that encourages consistency between predictions using either ground-truth or hallucinated visual representations. Extensive experiments on three standard translation datasets with a diverse set of language pairs demonstrate the effectiveness of our approach over both text-only baselines and state-of-the-art methods. Project page: http://www.svcl.ucsd.edu/projects/valhalla.

下载PDF全文

下载文献需遵守相关版权规定

论文标题