论文标题
拥抱一致性:时空视频接地的一个阶段方法
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding
论文作者
论文摘要
时空视频接地(STVG)的重点是检索由自由形式的文本表达式描绘的特定对象的时空管。现有方法主要将这一复杂的任务视为平行框架的问题,因此遭受了两种类型的不一致缺点:特征对齐不一致和预测不一致。在本文中,我们提出了一个端到端的一阶段框架,称为时空一致性 - 感知变压器(STCAT),以减轻这些问题。特别是,我们引入了一个新颖的多模式模板,作为解决此任务的全球目标,该目标明确限制了基础区域并将预测相关联。此外,要在足够的视频文本感知下生成上述模板,提出了一个编码器架构来进行有效的全局上下文建模。由于这些关键设计,STCAT享有更一致的跨模式特征对齐和管预测,而无需依赖任何预训练的对象探测器。广泛的实验表明,我们的方法在两个具有挑战性的视频基准(VIDSTG和HC-STVG)上胜过以前的最先进的最先进,这说明了拟议框架的优越性,以更好地理解视力与自然语言之间的关联。代码可在https://github.com/jy0205/stcat上公开获取。
Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression. Existing approaches mainly treat this complicated task as a parallel frame-grounding problem and thus suffer from two types of inconsistency drawbacks: feature alignment inconsistency and prediction inconsistency. In this paper, we present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT), to alleviate these issues. Specially, we introduce a novel multi-modal template as the global objective to address this task, which explicitly constricts the grounding region and associates the predictions among all video frames. Moreover, to generate the above template under sufficient video-textual perception, an encoder-decoder architecture is proposed for effective global context modeling. Thanks to these critical designs, STCAT enjoys more consistent cross-modal feature alignment and tube prediction without reliance on any pre-trained object detectors. Extensive experiments show that our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks (VidSTG and HC-STVG), illustrating the superiority of the proposed framework to better understanding the association between vision and natural language. Code is publicly available at https://github.com/jy0205/STCAT.