论文标题
视频时刻通过自然语言查询检索
Video Moment Retrieval via Natural Language Queries
论文作者
论文摘要
在本文中,我们提出了一种用于视频矩检索(VMR)的新颖方法,该方法在R@1指标上实现了艺术状态(SOTA)的性能,并超过了高IOU指标上的SOTA(r@1,iou = 0.7)。 首先,我们建议使用多头自发项机制,并进一步跨意义方案从视频上下文中捕获视频/查询互动和远程查询依赖性。基于注意力的方法可以在任意位置开发框架与疑问的交互作用,并在任意位置进行查询到框架的交互,并且多头设置可确保对复杂依赖关系的充分理解。我们的模型具有简单的体系结构,可以在维护的同时更快地进行培训和推理。 其次,我们还建议使用多个任务培训目标包括时刻细分任务,启动/结束分布预测以及开始/结束位置回归任务。我们已经证实,由于注释者的分歧和矩细分任务的共同培训,开始/结束预测是嘈杂的,因为目标剪辑内部的框架也被用作积极的训练示例,因此可以提供更丰富的信息。 第三,我们建议使用一种早期的融合方法,该方法以推理时间为代价实现更好的性能。但是,推理时间对我们的模型来说不是一个问题,因为我们的模型具有简单的体系结构,可以进行有效的培训和推理。
In this paper, we propose a novel method for video moment retrieval (VMR) that achieves state of the arts (SOTA) performance on R@1 metrics and surpassing the SOTA on the high IoU metric (R@1, IoU=0.7). First, we propose to use a multi-head self-attention mechanism, and further a cross-attention scheme to capture video/query interaction and long-range query dependencies from video context. The attention-based methods can develop frame-to-query interaction and query-to-frame interaction at arbitrary positions and the multi-head setting ensures the sufficient understanding of complicated dependencies. Our model has a simple architecture, which enables faster training and inference while maintaining . Second, We also propose to use multiple task training objective consists of moment segmentation task, start/end distribution prediction and start/end location regression task. We have verified that start/end prediction are noisy due to annotator disagreement and joint training with moment segmentation task can provide richer information since frames inside the target clip are also utilized as positive training examples. Third, we propose to use an early fusion approach, which achieves better performance at the cost of inference time. However, the inference time will not be a problem for our model since our model has a simple architecture which enables efficient training and inference.