学习双语语义关系与图像文本匹配的图形注意力

论文标题

学习双语语义关系与图像文本匹配的图形注意力

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

论文作者

Wen, Keyu, Gu, Xiaodong, Cheng, Qingrong

论文摘要

图像文本匹配是跨模式信息处理中的一项主要任务。主要的挑战是学习统一的视觉和文本表示。在此任务上表现良好的先前方法主要集中于图像中区域特征与句子中相应单词之间的对齐方式，还集中在区域关系与关系单词之间的对齐。但是，缺乏对区域特征和全球特征的联合学习将导致区域特征与全球环境失去联系，从而导致与某些句子中具有全球含义的非对象单词不匹配。在这项工作中，为了减轻此问题，有必要增强区域与区域和全球概念之间关系之间的关系，以获得更准确的视觉表示，以便更好地与相应的文本相关联。因此，提出了一种新型的多级语义关系增强方法，称为双语语义关系注意网络（DSRAN），主要由两个模块组成，分别由单独的语义关系模块和联合语义关系模块组成。 DSRAN在两个模块中分别对区域级别关系增强和区域 - 全球关系增强的图形注意力同时进行。使用这两个模块，同时学习了语义关系的不同层次结构，从而通过为最终的视觉表示提供了更多信息来促进图像文本匹配过程。由于双重语义关系学习方案的有效性，对MS-Coco和FlickR30K进行了定量实验结果，我们的方法优于先前的方法。代码可在https://github.com/kywen1119/dsran上找到。

Image-Text Matching is one major task in cross-modal information processing. The main challenge is to learn the unified visual and textual representations. Previous methods that perform well on this task primarily focus on not only the alignment between region features in images and the corresponding words in sentences, but also the alignment between relations of regions and relational words. However, the lack of joint learning of regional features and global features will cause the regional features to lose contact with the global context, leading to the mismatch with those non-object words which have global meanings in some sentences. In this work, in order to alleviate this issue, it is necessary to enhance the relations between regions and the relations between regional and global concepts to obtain a more accurate visual representation so as to be better correlated to the corresponding text. Thus, a novel multi-level semantic relations enhancement approach named Dual Semantic Relations Attention Network(DSRAN) is proposed which mainly consists of two modules, separate semantic relations module and the joint semantic relations module. DSRAN performs graph attention in both modules respectively for region-level relations enhancement and regional-global relations enhancement at the same time. With these two modules, different hierarchies of semantic relations are learned simultaneously, thus promoting the image-text matching process by providing more information for the final visual representation. Quantitative experimental results have been performed on MS-COCO and Flickr30K and our method outperforms previous approaches by a large margin due to the effectiveness of the dual semantic relations learning scheme. Codes are available at https://github.com/kywen1119/DSRAN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题