论文标题

通过双向跨模式匹配的不配对的参考表达接地

Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

论文作者

Shi, Hengcan, Hayat, Munawar, Cai, Jianfei

论文摘要

参考表达接地是计算机视觉中的一项重要且具有挑战性的任务。为了避免在常规的参考接地中进行费力的注释,引入了不配对的参考接地,其中训练数据仅包含许多图像和查询,而没有对应关系。由于学习图像文本匹配以及缺乏有未配对的数据的自上而下的指导,因此现有的未配对参考接地解决方案仍然是初步的。在本文中,我们提出了一个新型的双向跨模式匹配(BICM)框架,以应对这些挑战。特别是,我们设计了一个查询意识到的注意图(QAM)模块,该模块通过生成特定于查询的视觉注意图引入自上而下的视角。进一步引入了跨模式对象匹配(COM)模块,该模块利用了最近出现的图像文本匹配的预读预读的模型Clip,以从自下而上的角度预测目标对象。然后通过相似性趣味(SF)模块集成了自上而下的预测和自下而上的预测。我们还提出了一个知识适应匹配(KAM)模块,该模块利用未配对的培训数据以适应目标数据集和任务。实验表明,我们的框架在两个流行的接地数据集上的表现优于先前的作品6.55%和9.94%。

Referring expression grounding is an important and challenging task in computer vision. To avoid the laborious annotation in conventional referring grounding, unpaired referring grounding is introduced, where the training data only contains a number of images and queries without correspondences. The few existing solutions to unpaired referring grounding are still preliminary, due to the challenges of learning image-text matching and lack of the top-down guidance with unpaired data. In this paper, we propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges. Particularly, we design a query-aware attention map (QAM) module that introduces top-down perspective via generating query-specific visual attention maps. A cross-modal object matching (COM) module is further introduced, which exploits the recently emerged image-text matching pretrained model, CLIP, to predict the target objects from a bottom-up perspective. The top-down and bottom-up predictions are then integrated via a similarity funsion (SF) module. We also propose a knowledge adaptation matching (KAM) module that leverages unpaired training data to adapt pretrained knowledge to the target dataset and task. Experiments show that our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源