论文标题

弱监督的时间语言接地的细粒语义对齐网络

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

论文作者

Wang, Yuechen, Zhou, Wengang, Li, Houqiang

论文摘要

时间语言接地(TLG)旨在根据自然语言描述将视频段定位在未修剪的视频中。为了减轻时间边界标签的手动注释的昂贵成本,我们致力于弱监督的环境,其中仅提供视频级别的描述进行培训。大多数现有的弱监督方法都会通过基于MIL的框架生成候选细分市场集并学习跨模式对齐。但是,视频的时间结构以及句子中复杂的语义在学习过程中丢失了。在这项工作中,我们提出了一个新颖的无候选框架:细颗粒的语义对准网络(FSAN),用于弱监督的TLG。 FSAN没有将句子和候选时刻视为一个整体,而是通过迭代的跨模式相互作用模块学习令牌跨模式的语义对准,而是生成细粒度的跨模式语义对准图,并直接在地图顶部进行扎根。广泛的实验是在两个广泛使用的基准上进行的:ActivityNet-Captions和Didemo,我们的FSAN实现了最先进的性能。

Temporal language grounding (TLG) aims to localize a video segment in an untrimmed video based on a natural language description. To alleviate the expensive cost of manual annotations for temporal boundary labels, we are dedicated to the weakly supervised setting, where only video-level descriptions are provided for training. Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework. However, the temporal structure of the video as well as the complicated semantics in the sentence are lost during the learning. In this work, we propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG. Instead of view the sentence and candidate moments as a whole, FSAN learns token-by-clip cross-modal semantic alignment by an iterative cross-modal interaction module, generates a fine-grained cross-modal semantic alignment map, and performs grounding directly on top of the map. Extensive experiments are conducted on two widely-used benchmarks: ActivityNet-Captions, and DiDeMo, where our FSAN achieves state-of-the-art performance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源