弱监督的时间语言接地的细粒语义对齐网络

论文标题

弱监督的时间语言接地的细粒语义对齐网络

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

论文作者

Wang, Yuechen, Zhou, Wengang, Li, Houqiang

论文摘要

时间语言接地（TLG）旨在根据自然语言描述将视频段定位在未修剪的视频中。为了减轻时间边界标签的手动注释的昂贵成本，我们致力于弱监督的环境，其中仅提供视频级别的描述进行培训。大多数现有的弱监督方法都会通过基于MIL的框架生成候选细分市场集并学习跨模式对齐。但是，视频的时间结构以及句子中复杂的语义在学习过程中丢失了。在这项工作中，我们提出了一个新颖的无候选框架：细颗粒的语义对准网络（FSAN），用于弱监督的TLG。 FSAN没有将句子和候选时刻视为一个整体，而是通过迭代的跨模式相互作用模块学习令牌跨模式的语义对准，而是生成细粒度的跨模式语义对准图，并直接在地图顶部进行扎根。广泛的实验是在两个广泛使用的基准上进行的：ActivityNet-Captions和Didemo，我们的FSAN实现了最先进的性能。

Temporal language grounding (TLG) aims to localize a video segment in an untrimmed video based on a natural language description. To alleviate the expensive cost of manual annotations for temporal boundary labels, we are dedicated to the weakly supervised setting, where only video-level descriptions are provided for training. Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework. However, the temporal structure of the video as well as the complicated semantics in the sentence are lost during the learning. In this work, we propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG. Instead of view the sentence and candidate moments as a whole, FSAN learns token-by-clip cross-modal semantic alignment by an iterative cross-modal interaction module, generates a fine-grained cross-modal semantic alignment map, and performs grounding directly on top of the map. Extensive experiments are conducted on two widely-used benchmarks: ActivityNet-Captions, and DiDeMo, where our FSAN achieves state-of-the-art performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题