论文标题

PACS:用于物理视听常识性推理的数据集

PACS: A Dataset for Physical Audiovisual CommonSense Reasoning

论文作者

Yu, Samuel, Wu, Peter, Liang, Paul Pu, Salakhutdinov, Ruslan, Morency, Louis-Philippe

论文摘要

为了使AI安全部署在医院,学校和工作场所等现实世界中,它必须能够强烈理解物理世界。推理的基础是物理常识:了解可用对象的物理特性和提供的能力,如何被操纵以及它们如何与其他对象相互作用。物理常识性推理从根本上是一项多感觉任务,因为物理特性是通过多种模式表现出来的,其中两个是视觉和声学。我们的论文通过贡献PACS迈出了一步,朝着现实世界中的物理常识推理:第一个用于物理常识属性注释的视听基准。 PACS包含13,400对提问,涉及1,377个独特的物理常识性问题和1,526个视频。我们的数据集提供了新的机会来通过将音频作为此多模式问题的核心组成部分来推进物理推理的研究领域。使用PACS,我们在我们的新挑战性任务上评估了多种最先进的模型。尽管某些模型显示出令人鼓舞的结果(精度为70%),但它们都没有人类绩效(精度为95%)。我们通过证明多模式推理的重要性并为未来的研究提供了可能的途径来结束本文。

In order for AI to be safely deployed in real-world scenarios such as hospitals, schools, and the workplace, it must be able to robustly reason about the physical world. Fundamental to this reasoning is physical common sense: understanding the physical properties and affordances of available objects, how they can be manipulated, and how they interact with other objects. Physical commonsense reasoning is fundamentally a multi-sensory task, since physical properties are manifested through multiple modalities - two of them being vision and acoustics. Our paper takes a step towards real-world physical commonsense reasoning by contributing PACS: the first audiovisual benchmark annotated for physical commonsense attributes. PACS contains 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos. Our dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem. Using PACS, we evaluate multiple state-of-the-art models on our new challenging task. While some models show promising results (70% accuracy), they all fall short of human performance (95% accuracy). We conclude the paper by demonstrating the importance of multimodal reasoning and providing possible avenues for future research.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源