圆锥：长期视频暂时接地的有效的粗到定对齐框架

论文标题

圆锥：长期视频暂时接地的有效的粗到定对齐框架

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

论文作者

Hou, Zhijian, Zhong, Wanjun, Ji, Lei, Gao, Difei, Yan, Kun, Chan, Wing-Kwong, Ngo, Chong-Wah, Shou, Zheng, Duan, Nan

论文摘要

本文解决了长期视频暂时基础〜（VTG）的新出现且具有挑战性的问题，该问题将与自然语言（NL）查询相关的视频瞬间定位。与简短的视频相比，长期的视频也被高度要求，但探索量较少，这在更高的推理计算成本和较弱的多模式对齐中带来了新的挑战。为了应对这些挑战，我们提出了一个有效的粗到5对齐框架锥体。 Cone是现有VTG型号之上的插件框架，可通过滑动窗口机构处理长视频。具体而言，Cone（1）引入了一种查询引导的窗口选择策略以加快推理的速度，（2）通过对比度学习的新颖融合提出了一种粗到最新的机制，以增强长视频的多模式比对。对两个大规模长VTG基准测试的广泛实验始终显示出大量的性能提高（例如，疯狂的3.13％至6.87％）和最先进的结果。分析还显示出更高的效率，因为查询引导的窗口选择机制在EGO4D-NLQ上加速了2倍的推理时间，而在MAD上，在保持SOTA结果的同时，在MAD上加速了15倍。代码已在https://github.com/houzhijian/cone上发布。

This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13% to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题