论文标题
致密但有效的视频QA,用于复杂的作曲推理
Dense but Efficient VideoQA for Intricate Compositional Reasoning
论文作者
论文摘要
众所周知,大多数传统的视频询问答案(VideoQA)数据集由需要简单推理过程的简单问题组成。但是,长视频不可避免地包含复杂和组成的语义结构以及时空轴,该轴需要一个模型来了解视频中固有的组成结构。在本文中,我们建议一种基于变压器体系结构的新组成录像方法,具有可变形的注意机制来解决复杂的视频QA任务。引入了可变形的注意力,以对密集的视觉特征图中的一部分信息视觉特征进行采样,以有效地覆盖时间长范围的框架。此外,复杂问题句子中的依赖性结构还与语言嵌入结合在一起,以便于理解问题词之间的关系。广泛的实验和消融研究表明,建议的密集但有效的模型的表现优于其他基准。
It is well known that most of the conventional video question answering (VideoQA) datasets consist of easy questions requiring simple reasoning processes. However, long videos inevitably contain complex and compositional semantic structures along with the spatio-temporal axis, which requires a model to understand the compositional structures inherent in the videos. In this paper, we suggest a new compositional VideoQA method based on transformer architecture with a deformable attention mechanism to address the complex VideoQA tasks. The deformable attentions are introduced to sample a subset of informative visual features from the dense visual feature map to cover a temporally long range of frames efficiently. Furthermore, the dependency structure within the complex question sentences is also combined with the language embeddings to readily understand the relations among question words. Extensive experiments and ablation studies show that the suggested dense but efficient model outperforms other baselines.