论文标题
视频问题回答的模棱两可的基础
Equivariant and Invariant Grounding for Video Question Answering
论文作者
论文摘要
视频问题回答(videoqa)是回答有关视频的自然语言问题的任务。产生答案需要了解有关视频和语言语义中的视觉场景之间的相互作用。但是,大多数领先的VideoQA模型都可以用作黑匣子,这使得在答案过程背后的视觉语言对齐变得晦涩难懂。这种黑框的自然要求具有可见性的性能,揭示了``视频的哪一部分应该考虑回答问题?''。只有少数作品以事后的方式呈现视觉解释,该解释通过其他方法模仿了目标模型的答案过程。尽管如此,仿真努力在回答过程中忠实地表现出视觉语言的一致性。 我们专注于使答案过程透明的固有解释性,而不是事后解释性。从本质上讲,关键问题的线索将其作为因果场景,以产生答案,同时推出问题的信息作为环境场景。从因果关系看VideoQA,我们设计了一个自我解释的框架,对可解释的VideoQA(EIGV)的模棱两可和不变的基础。具体而言,均衡的基础鼓励答案对因果场景和问题的语义变化敏感。相比之下,不变的接地强迫答案对环境场景的变化不敏感。通过将它们强加于答案过程,EIGV能够将因果场景与环境信息区分开,并明确介绍视觉语言的一致性。在三个基准数据集上进行的广泛实验证明了EIGV的优势在准确性和视觉解释性方面比领先的基线相比。
Video Question Answering (VideoQA) is the task of answering the natural language questions about a video. Producing an answer requires understanding the interplay across visual scenes in video and linguistic semantics in question. However, most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure. Such black-box nature calls for visual explainability that reveals ``What part of the video should the model look at to answer the question?''. Only a few works present the visual explanations in a post-hoc fashion, which emulates the target model's answering process via an additional method. Nonetheless, the emulation struggles to faithfully exhibit the visual-linguistic alignment during answering. Instead of post-hoc explainability, we focus on intrinsic interpretability to make the answering process transparent. At its core is grounding the question-critical cues as the causal scene to yield answers, while rolling out the question-irrelevant information as the environment scene. Taking a causal look at VideoQA, we devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV). Specifically, the equivariant grounding encourages the answering to be sensitive to the semantic changes in the causal scene and question; in contrast, the invariant grounding enforces the answering to be insensitive to the changes in the environment scene. By imposing them on the answering process, EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment. Extensive experiments on three benchmark datasets justify the superiority of EIGV in terms of accuracy and visual interpretability over the leading baselines.