野外的图形结构化参考表达推理

论文标题

野外的图形结构化参考表达推理

Graph-Structured Referring Expression Reasoning in The Wild

论文作者

Yang, Sibei, Li, Guanbin, Yu, Yizhou

论文摘要

接地参考表达式旨在将自然语言表达式提到的对象定位在图像中。引用表达的语言结构在视觉内容上提供了推理的布局，并且通常对对齐和共同理解图像和参考表达通常至关重要。在本文中，我们提出了一个场景图指导的模块化网络（SGMN），该网络在语义图和场景图上进行推理，并在表达式的语言结构的指导下具有神经模块。特别是，我们将图像建模为结构化语义图，然后将表达式解析为语言场景图。语言场景图不仅解码了表达式的语言结构，而且与图像语义图具有一致的表示。除了探索接地参考表达式的结构化解决方案外，我们还提出了Ref-Remounting，这是用于结构化参考表达推理的大型现实世界数据集。我们会使用各种表达模板和功能程序自动在图像的场景图上生成引用表达式。该数据集配备了现实世界的视觉内容以及具有不同推理布局的语义丰富表达式。实验结果表明，我们的SGMN不仅显着超过了新的介绍数据集上现有的最新算法，而且还超过了常用基准数据集上的最新结构化方法。它还可以提供可解释的推理的视觉证据。数据和代码可从https://github.com/sibeiyang/sgmn获得

Grounding referring expressions aims to locate in an image an object referred to by a natural language expression. The linguistic structure of a referring expression provides a layout of reasoning over the visual contents, and it is often crucial to align and jointly understand the image and the referring expression. In this paper, we propose a scene graph guided modular network (SGMN), which performs reasoning over a semantic graph and a scene graph with neural modules under the guidance of the linguistic structure of the expression. In particular, we model the image as a structured semantic graph, and parse the expression into a language scene graph. The language scene graph not only decodes the linguistic structure of the expression, but also has a consistent representation with the image semantic graph. In addition to exploring structured solutions to grounding referring expressions, we also propose Ref-Reasoning, a large-scale real-world dataset for structured referring expression reasoning. We automatically generate referring expressions over the scene graphs of images using diverse expression templates and functional programs. This dataset is equipped with real-world visual contents as well as semantically rich expressions with different reasoning layouts. Experimental results show that our SGMN not only significantly outperforms existing state-of-the-art algorithms on the new Ref-Reasoning dataset, but also surpasses state-of-the-art structured methods on commonly used benchmark datasets. It can also provide interpretable visual evidences of reasoning. Data and code are available at https://github.com/sibeiyang/sgmn

下载PDF全文

下载文献需遵守相关版权规定

论文标题