VisualComet：关于静止图像的动态上下文的推理

论文标题

VisualComet：关于静止图像的动态上下文的推理

VisualCOMET: Reasoning about the Dynamic Context of a Still Image

论文作者

Park, Jae Sung, Bhagavatula, Chandra, Mottaghi, Roozbeh, Farhadi, Ali, Choi, Yejin

论文摘要

即使是从静止图像的单一框架中，人们也可以推理图像的动态故事之前，之后和之外的框架。例如，考虑到一个人努力在水中漂浮的形象，我们可以认为该男子过去的某个时候掉进了水中，目前那个男人的意图是保持生命，他将在不久的将来需要帮助，否则他会被洗掉。我们提出了VisualComet，这是视觉常识性推理任务的新颖框架，以预测可能发生的事件，可能发生的事件以及目前的人们的意图。为了支持有关视觉常识性推理的研究，我们介绍了第一个大型视觉常识图的大规模存储库，该存储库由超过140万的文本描述，对视觉常识推断的文本描述仔细地注释了60,000张图像，每个图像与前后的短视频摘要配对。此外，我们在图像中出现的人和文本常识性描述中提到的人之间提供了人的接地（即共同参考链接），从而可以在图像和文本之间进行更严格的整合。我们在这项任务上建立了强大的基线表现，并证明视觉和文本常识性推理之间的集成是关键，并赢得了非综合替代方案的胜利。

Even from a single frame of a still image, people can reason about the dynamic story of the image before, after, and beyond the frame. For example, given an image of a man struggling to stay afloat in water, we can reason that the man fell into the water sometime in the past, the intent of that man at the moment is to stay alive, and he will need help in the near future or else he will get washed away. We propose VisualComet, the novel framework of visual commonsense reasoning tasks to predict events that might have happened before, events that might happen next, and the intents of the people at present. To support research toward visual commonsense reasoning, we introduce the first large-scale repository of Visual Commonsense Graphs that consists of over 1.4 million textual descriptions of visual commonsense inferences carefully annotated over a diverse set of 60,000 images, each paired with short video summaries of before and after. In addition, we provide person-grounding (i.e., co-reference links) between people appearing in the image and people mentioned in the textual commonsense descriptions, allowing for tighter integration between images and text. We establish strong baseline performances on this task and demonstrate that integration between visual and textual commonsense reasoning is the key and wins over non-integrative alternatives.

下载PDF全文

下载文献需遵守相关版权规定

论文标题