论文标题
学习使用问题在视频语料库中找到视觉答案
Learning to Locate Visual Answer in Video Corpus Using Question
论文作者
论文摘要
我们介绍了一项新任务,名为视频语料库视觉答案本地化(VCVAL),该任务旨在使用自然语言问题在大量未修剪的教学视频中找到视觉答案。此任务需要一系列技能 - 视觉和语言之间的相互作用,视频检索,通过理解和视觉答案本地化。在本文中,我们提出了一种用于VCVAL的跨模式对比度全球跨度(CCGS)方法,共同训练视频语料库检索和视觉答案与全球跨度矩阵进行了子任务。我们重建了一个名为MEDVIDCQA的数据集,在该数据集上标有VCVAL任务。实验结果表明,该提出的方法在视频语料库检索和视觉答案定位子任务中都优于其他竞争方法。最重要的是,我们对广泛的实验进行了详细的分析,为了解教学视频的新途径铺平了一条新的途径,这些途径在进一步的研究中引起了人们的了解。
We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate the visual answer in a large collection of untrimmed instructional videos using a natural language question. This task requires a range of skills - the interaction between vision and language, video retrieval, passage comprehension, and visual answer localization. In this paper, we propose a cross-modal contrastive global-span (CCGS) method for the VCVAL, jointly training the video corpus retrieval and visual answer localization subtasks with the global-span matrix. We have reconstructed a dataset named MedVidCQA, on which the VCVAL task is benchmarked. Experimental results show that the proposed method outperforms other competitive methods both in the video corpus retrieval and visual answer localization subtasks. Most importantly, we perform detailed analyses on extensive experiments, paving a new path for understanding the instructional videos, which ushers in further research.