视频参考表达理解的通讯很重要

论文标题

视频参考表达理解的通讯很重要

Correspondence Matters for Video Referring Expression Comprehension

论文作者

Cao, Meng, Jiang, Ji, Chen, Long, Zou, Yuexian

论文摘要

我们研究了视频引用表达理解（REC）的问题，该问题旨在将句子中描述的参考对象定位为视频帧中的视觉区域。尽管最近取得了进展，但现有方法遇到了两个问题：1）视频框架之间的本地化结果不一致； 2）参考对象和上下文对象之间的混淆。为此，我们提出了一个新型的双对应网络（称为DCNET），该网络明确增强了框架间和跨模式的密集关联。首先，我们旨在为框架内所有现有实例建立框架间的相关性。具体而言，我们计算框架间斑块的余弦相似性，以估计密集的对齐方式，然后执行框架间的对比度学习，以在特征空间中映射它们。其次，我们建议构建细粒的贴片字对齐，以将每个贴片与某些单词相关联。由于缺乏这种详细的注释，我们还通过余弦相似性预测了斑点字的对应关系。广泛的实验表明，我们的DCNET在视频和图像rec基准测试中都达到了最先进的性能。此外，我们进行了全面的消融研究和详尽的分析，以探索最佳模型设计。值得注意的是，我们的框架间和跨模式对比损失是插件功能，适用于任何视频架构架构。例如，通过在共同接地之上进行构建，我们在VID句子数据集的Accu。@0.5上提高了1.48％的性能。

We investigate the problem of video Referring Expression Comprehension (REC), which aims to localize the referent objects described in the sentence to visual regions in the video frames. Despite the recent progress, existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects. To this end, we propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners. Firstly, we aim to build the inter-frame correlations for all existing instances within the frames. Specifically, we compute the inter-frame patch-wise cosine similarity to estimate the dense alignment and then perform the inter-frame contrastive learning to map them close in feature space. Secondly, we propose to build the fine-grained patch-word alignment to associate each patch with certain words. Due to the lack of this kind of detailed annotations, we also predict the patch-word correspondence through the cosine similarity. Extensive experiments demonstrate that our DCNet achieves state-of-the-art performance on both video and image REC benchmarks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimal model designs. Notably, our inter-frame and cross-modal contrastive losses are plug-and-play functions and are applicable to any video REC architectures. For example, by building on top of Co-grounding, we boost the performance by 1.48% absolute improvement on [email protected] for VID-Sentence dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题