论文标题

在360°图像上回答的视觉问题

Visual Question Answering on 360° Images

论文作者

Chou, Shih-Han, Chao, Wei-Lun, Lai, Wei-Sheng, Sun, Min, Yang, Ming-Hsuan

论文摘要

在这项工作中,我们介绍了VQA 360,这是一个新颖的视觉问题,回答了360张图像。与普通的视野图像不同,360图像捕获了相机光学中心周围的整个视觉内容,要求更复杂的空间理解和推理。为了解决这个问题,我们收集了第一个VQA 360数据集,其中包含大约17,000个现实世界图像问题 - 问题 - 招标三重态,用于各种问题类型。然后,我们在VQA 360上研究了两个不同的VQA模型,其中包括一种传统模型,该模型将等效的图像(具有内在失真)作为输入和一个专用模型,该模型首先将360图像投射到Cubemaps上,然后从多个空间分辨率中汇总信息。我们证明,具有多级融合和注意扩散的基于Cubemap的模型对其他变体和基于等值的模型的性能有利。然而,人类和机器的性能之间的差距揭示了需要更先进的VQA 360算法。因此,我们期望我们的数据集和研究是这项具有挑战性的任务中未来发展的基准。数据集,代码和预培训模型可在线提供。

In this work, we introduce VQA 360, a novel task of visual question answering on 360 images. Unlike a normal field-of-view image, a 360 image captures the entire visual content around the optical center of a camera, demanding more sophisticated spatial understanding and reasoning. To address this problem, we collect the first VQA 360 dataset, containing around 17,000 real-world image-question-answer triplets for a variety of question types. We then study two different VQA models on VQA 360, including one conventional model that takes an equirectangular image (with intrinsic distortion) as input and one dedicated model that first projects a 360 image onto cubemaps and subsequently aggregates the information from multiple spatial resolutions. We demonstrate that the cubemap-based model with multi-level fusion and attention diffusion performs favorably against other variants and the equirectangular-based models. Nevertheless, the gap between the humans' and machines' performance reveals the need for more advanced VQA 360 algorithms. We, therefore, expect our dataset and studies to serve as the benchmark for future development in this challenging task. Dataset, code, and pre-trained models are available online.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源