具有多模式特征的建模运动，用于基于文本的视频细分

论文标题

具有多模式特征的建模运动，用于基于文本的视频细分

Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation

论文作者

Zhao, Wangbo, Wang, Kai, Chu, Xiangxiang, Xue, Fuzhao, Wang, Xinchao, You, Yang

论文摘要

基于文本的视频细分旨在根据描述句子在视频中细分目标对象。将光流图中的运动信息与外观和语言模式结合在一起至关重要，但以前的工作在很大程度上被忽略了。在本文中，我们设计了一种融合和对齐外观，运动和语言特征以实现准确分割的方法。具体来说，我们提出了一个多模式视频变压器，该变压器可以融合和汇总帧之间的多模式和时间特征。此外，我们设计了一种语言引导的功能融合模块，以在每个功能级别逐步融合外观和运动功能，并与语言特征的指导。最后，提出了多模式的对准损失，以减轻不同方式的特征之间的语义差距。与最先进的方法相比，对A2D句子和J-HMDB句子的广泛实验验证了我们方法的性能和概括能力。

Text-based video segmentation aims to segment the target object in a video based on a describing sentence. Incorporating motion information from optical flow maps with appearance and linguistic modalities is crucial yet has been largely ignored by previous work. In this paper, we design a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation. Specifically, we propose a multi-modal video transformer, which can fuse and aggregate multi-modal and temporal features between frames. Furthermore, we design a language-guided feature fusion module to progressively fuse appearance and motion features in each feature level with guidance from linguistic features. Finally, a multi-modal alignment loss is proposed to alleviate the semantic gap between features from different modalities. Extensive experiments on A2D Sentences and J-HMDB Sentences verify the performance and the generalization ability of our method compared to the state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题