论文标题
将更多注意力转移到视觉主链上:查询调制的改进网络以端到端的视觉接地
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding
论文作者
论文摘要
视觉基础的重点是建立视觉和自然语言之间的细粒度对齐,这在多模式推理系统中具有重要的应用。现有方法使用预先训练的查询无形的视觉主机独立提取视觉特征地图,而无需考虑查询信息。我们认为,从视觉主干中提取的视觉特征以及多模式推理所需的特征是不一致的。原因之一是,训练任务和视觉接地之间存在差异。此外,由于骨架是查询敏捷的,因此很难通过在视觉接地框架中训练视觉骨架端到端来完全避免不一致问题。在本文中,我们提出了一个查询调整的改进网络(QRNET),以通过在视觉主链中调整中间特征,并具有新颖的查询感知动态注意力(QD-ATT)机制和查询意识到的多尺度融合,以解决不一致的问题。 QD-ATT可以在视觉主链产生的特征图的空间和通道水平上动态计算查询依赖性视觉关注。我们将QRNET应用于端到端的视觉接地框架。广泛的实验表明,所提出的方法在五个广泛使用的数据集上优于最先进的方法。
Visual grounding focuses on establishing fine-grained alignment between vision and natural language, which has essential applications in multimodal reasoning systems. Existing methods use pre-trained query-agnostic visual backbones to extract visual feature maps independently without considering the query information. We argue that the visual features extracted from the visual backbones and the features really needed for multimodal reasoning are inconsistent. One reason is that there are differences between pre-training tasks and visual grounding. Moreover, since the backbones are query-agnostic, it is difficult to completely avoid the inconsistency issue by training the visual backbone end-to-end in the visual grounding framework. In this paper, we propose a Query-modulated Refinement Network (QRNet) to address the inconsistent issue by adjusting intermediate features in the visual backbone with a novel Query-aware Dynamic Attention (QD-ATT) mechanism and query-aware multiscale fusion. The QD-ATT can dynamically compute query-dependent visual attention at the spatial and channel levels of the feature maps produced by the visual backbone. We apply the QRNet to an end-to-end visual grounding framework. Extensive experiments show that the proposed method outperforms state-of-the-art methods on five widely used datasets.