STACMR：场景文本意识到跨模式检索

论文标题

STACMR：场景文本意识到跨模式检索

StacMR: Scene-Text Aware Cross-Modal Retrieval

论文作者

Mafla, Andrés, de Rezende, Rafael Sampaio, Gómez, Lluís, Larlus, Diane, Karatzas, Dimosthenis

论文摘要

跨模式检索的最新模型受益于对视觉场景的越来越丰富的理解，这是通过场景图和对象相互作用提供的。这导致了图像的视觉表示与其标题的文本表示之间的改进。然而，当前的视觉表示忽略了一个关键方面：图像中出现的文本，其中可能包含至关重要的信息以进行检索。在本文中，我们首先提出了一个新的数据集，该数据集允许探索跨模式检索，其中图像包含场景文本实例。然后，使用此数据集武装，我们描述了几种利用场景文本的方法，其中包括更好的场景 - 文本意识到的跨模式检索方法，该方法使用了来自字幕和视觉场景中文本的特殊表示形式，并将它们调和它们在常见的嵌入空间中。广泛的实验证实，跨模式检索方法受益于场景文本，并突出了值得进一步探索的有趣的研究问题。数据集和代码可从http://europe.naverlabs.com/stacmr获得

Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modal retrieval approaches benefit from scene text and highlight interesting research questions worth exploring further. Dataset and code are available at http://europe.naverlabs.com/stacmr

下载PDF全文

下载文献需遵守相关版权规定

论文标题