stvgformer：具有静态动力的跨模式理解的时空视频接地

论文标题

stvgformer：具有静态动力的跨模式理解的时空视频接地

STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding

论文作者

Lin, Zihang, Tan, Chaolei, Hu, Jian-Fang, Jin, Zhi, Ye, Tiancai, Zheng, Wei-Shi

论文摘要

在这份技术报告中，我们将解决方案介绍给以人为中心的时空视频接地任务。我们提出了一个名为stvgformer的简洁有效的框架，该框架用静态分支和动态分支对时空视觉语言依赖性进行建模。静态分支在单个帧中执行跨模式的理解，并根据框架内视觉提示（如对象出现）学习在空间上定位目标对象。动态分支在多个帧上执行跨模式的理解。它学会了根据动作（如动作）的动态视觉提示来预测目标力矩的开始和结束时间。静态分支和动态分支均设计为跨模式变压器。我们进一步设计了一种新型的静态动力相互作用块，以使静态和动态分支可以彼此转移有用和互补信息，这被证明可以有效地改善对硬病例的预测。我们提出的方法达到了39.6％的VIOU，并在上下文挑战中获得了第四人的HC-STVG曲目中的第一名。

In this technical report, we introduce our solution to human-centric spatio-temporal video grounding task. We propose a concise and effective framework named STVGFormer, which models spatiotemporal visual-linguistic dependencies with a static branch and a dynamic branch. The static branch performs cross-modal understanding in a single frame and learns to localize the target object spatially according to intra-frame visual cues like object appearances. The dynamic branch performs cross-modal understanding across multiple frames. It learns to predict the starting and ending time of the target moment according to dynamic visual cues like motions. Both the static and dynamic branches are designed as cross-modal transformers. We further design a novel static-dynamic interaction block to enable the static and dynamic branches to transfer useful and complementary information from each other, which is shown to be effective to improve the prediction on hard cases. Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG track of the 4th Person in Context Challenge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题