自然语言理由具有全堆栈视觉推理：从像素到语义帧再到常识图

论文标题

自然语言理由具有全堆栈视觉推理：从像素到语义帧再到常识图

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

论文作者

Marasović, Ana, Bhagavatula, Chandra, Park, Jae Sung, Bras, Ronan Le, Smith, Noah A., Choi, Yejin

论文摘要

自然语言理由可以提供直观，更高级别的解释，这些解释很容易通过人类可以理解，并补充了基于梯度或注意力重量的更广泛研究的低级解释。我们介绍了第一个研究的重点是在几个复杂的视觉推理任务中生成自然语言理由：视觉常识性推理，视觉文本构成和视觉问题回答。准确合理化的主要挑战是在各个层面上进行全面的图像理解：不仅是它们在像素级别上的明确内容，而且在语义和务实层面上的上下文内容。我们提出了理由^VT变形金刚，这是一个集成的模型，通过将预处理的语言模型与对象识别，接地的视觉语义框架和视觉共识图相结合，从而学会生成自由文本的理由。我们的实验表明，基本预处理的语言模型从视觉适应中受益，而自由文本合理化是一个有前途的研究方向，可以补充复杂的视觉文本推理任务的模型。

Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights. We present the first study focused on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. The key challenge of accurate rationalization is comprehensive image understanding at all levels: not just their explicit content at the pixel level, but their contextual contents at the semantic and pragmatic levels. We present Rationale^VT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题