论文标题

STACMR:场景文本意识到跨模式检索

StacMR: Scene-Text Aware Cross-Modal Retrieval

论文作者

Mafla, Andrés, de Rezende, Rafael Sampaio, Gómez, Lluís, Larlus, Diane, Karatzas, Dimosthenis

论文摘要

跨模式检索的最新模型受益于对视觉场景的越来越丰富的理解,这是通过场景图和对象相互作用提供的。这导致了图像的视觉表示与其标题的文本表示之间的改进。然而,当前的视觉表示忽略了一个关键方面:图像中出现的文本,其中可能包含至关重要的信息以进行检索。在本文中,我们首先提出了一个新的数据集,该数据集允许探索跨模式检索,其中图像包含场景文本实例。然后,使用此数据集武装,我们描述了几种利用场景文本的方法,其中包括更好的场景 - 文本意识到的跨模式检索方法,该方法使用了来自字幕和视觉场景中文本的特殊表示形式,并将它们调和它们在常见的嵌入空间中。广泛的实验证实,跨模式检索方法受益于场景文本,并突出了值得进一步探索的有趣的研究问题。数据集和代码可从http://europe.naverlabs.com/stacmr获得

Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modal retrieval approaches benefit from scene text and highlight interesting research questions worth exploring further. Dataset and code are available at http://europe.naverlabs.com/stacmr

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源