论文标题
深层交织的两流编码器用于引用视频分割
Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation
论文作者
论文摘要
参考视频细分旨在分割语言表达式描述的相应视频对象。为了解决此任务,我们首先设计了一个两流编码器,以层次上提取基于CNN的视觉特征和基于变压器的语言特征,并且多次将视觉语言相互指导(VLMG)模块插入编码器中多次插入编码器,以促进多模式特征的层次结构和渐进式融合。与现有的多模式融合方法相比,此两流编码器考虑了多粒性语言环境,并在VLGM的帮助下实现了模式之间的深层交织。为了促进框架之间的时间对齐,我们进一步提出了语言引导的多尺度动态滤波(LMDF)模块,以增强时间连贯性,该模块使用语言指导的空间速度特征来产生一组位置特异性的动态滤镜,以更灵活地有效地更新当前帧的功能。在四个数据集上进行的大量实验验证了所提出的模型的有效性。
Referring video segmentation aims to segment the corresponding video object described by the language expression. To address this task, we first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically, and a vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. Compared with the existing multi-modal fusion methods, this two-stream encoder takes into account the multi-granularity linguistic context, and realizes the deep interleaving between modalities with the help of VLGM. In order to promote the temporal alignment between frames, we further propose a language-guided multi-scale dynamic filtering (LMDF) module to strengthen the temporal coherence, which uses the language-guided spatial-temporal features to generate a set of position-specific dynamic filters to more flexibly and effectively update the feature of current frame. Extensive experiments on four datasets verify the effectiveness of the proposed model.