稀疏图学习和知识传递的推理视觉对话框

论文标题

稀疏图学习和知识传递的推理视觉对话框

Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer

论文作者

Kang, Gi-Cheon, Park, Junseok, Lee, Hwaran, Zhang, Byoung-Tak, Kim, Jin-Hwa

论文摘要

视觉对话框是一种使用以前的对话记录作为上下文来回答基于图像中的一系列问题的任务。在本文中，我们研究了如何解决此任务的两个基本挑战：（1）对话回合中的基本语义结构的推理以及（2）确定给定问题的几个适当答案。为了应对这些挑战，我们提出了一种稀疏的图形学习（SGL）方法，以将视觉对话框作为图形结构学习任务。 SGL通过合并二进制和评分边缘并利用新的结构损失函数来固有地稀疏对话结构。接下来，我们介绍一种知识转移（KT）方法，该方法从教师模型中提取答案预测，并将其用作伪标签。我们建议KT纠正单个基真实标签的缺点，这严重限制了模型获得多个合理答案的能力。结果，与基线方法相比，我们提出的模型显着提高了推理能力，并且优于Visdial V1.0数据集的最先进方法。源代码可在https://github.com/gicheonkang/sglkt-visdial上找到。

Visual dialog is a task of answering a sequence of questions grounded in an image using the previous dialog history as context. In this paper, we study how to address two fundamental challenges for this task: (1) reasoning over underlying semantic structures among dialog rounds and (2) identifying several appropriate answers to the given question. To address these challenges, we propose a Sparse Graph Learning (SGL) method to formulate visual dialog as a graph structure learning task. SGL infers inherently sparse dialog structures by incorporating binary and score edges and leveraging a new structural loss function. Next, we introduce a Knowledge Transfer (KT) method that extracts the answer predictions from the teacher model and uses them as pseudo labels. We propose KT to remedy the shortcomings of single ground-truth labels, which severely limit the ability of a model to obtain multiple reasonable answers. As a result, our proposed model significantly improves reasoning capability compared to baseline methods and outperforms the state-of-the-art approaches on the VisDial v1.0 dataset. The source code is available at https://github.com/gicheonkang/SGLKT-VisDial.

下载PDF全文

下载文献需遵守相关版权规定

论文标题