论文标题
带基于搜索引擎的图像检索的多模式神经机器翻译
Multimodal Neural Machine Translation with Search Engine Based Image Retrieval
论文作者
论文摘要
最近,数量表明,通过使用视觉信息,可以在一定程度上改进神经机译(NMT)的性能。但是,这些结论中的大多数都是从基于有限的双语句子图像对的实验结果的分析中得出的,例如Multi30k。在这类数据集中,必须通过手动注释的图像很好地表示一个双语并行句子对的内容,这与实际翻译情况不同。提出了一些先前的作品,以通过从退出的句子图像对与主题模型中检索图像来解决问题。但是,由于他们使用的句子图像对的收集有限,因此很难处理其图像检索方法,并且很难证明视觉信息可以增强NMT,而不是图像和句子的同时出现。在本文中,我们提出了一种开放式摄影图像检索方法,以使用图像搜索引擎收集双语平行语料库的描述性图像。接下来,我们提出文本感知的专注视觉编码器,以过滤错误收集的噪声图像。 Multi30k和其他两个翻译数据集的实验结果表明,我们提出的方法对强基础可取得重大改进。
Recently, numbers of works shows that the performance of neural machine translation (NMT) can be improved to a certain extent with using visual information. However, most of these conclusions are drawn from the analysis of experimental results based on a limited set of bilingual sentence-image pairs, such as Multi30K. In these kinds of datasets, the content of one bilingual parallel sentence pair must be well represented by a manually annotated image, which is different with the actual translation situation. Some previous works are proposed to addressed the problem by retrieving images from exiting sentence-image pairs with topic model. However, because of the limited collection of sentence-image pairs they used, their image retrieval method is difficult to deal with the out-of-vocabulary words, and can hardly prove that visual information enhance NMT rather than the co-occurrence of images and sentences. In this paper, we propose an open-vocabulary image retrieval methods to collect descriptive images for bilingual parallel corpus using image search engine. Next, we propose text-aware attentive visual encoder to filter incorrectly collected noise images. Experiment results on Multi30K and other two translation datasets show that our proposed method achieves significant improvements over strong baselines.