行人意图预测的时空关系推理

论文标题

行人意图预测的时空关系推理

Spatiotemporal Relationship Reasoning for Pedestrian Intent Prediction

论文作者

Liu, Bingbin, Adeli, Ehsan, Cao, Zhangjie, Lee, Kuan-Hui, Shenoi, Abhijeet, Gaidon, Adrien, Niebles, Juan Carlos

论文摘要

视觉数据的推理是机器人技术和基于视觉应用的理想能力。这种推理可以预测视频中的下一个事件或动作。近年来，已经根据预测或预测的卷积操作开发了各种模型，但是它们缺乏对时空数据进行推理的能力，并推断了场景中不同对象的关系。在本文中，我们提出了一个基于图形卷积的框架，以揭示场景中的时空关系，以推理行人意图。场景图位于视频帧内部和跨视频帧内部和跨越的对象实例的顶部。行人意图被定义为越过或不划过街道的未来动作，是一个非常关键的信息，可用于自动驾驶汽车安全，更平稳行驶。我们从两个不同的角度解决了意图预测的问题，并预测以人行人为中心和以位置为中心的方案中的意向性。此外，我们推出了一个专门为自动驾驶人群的自动驾驶场景而设计的新数据集：斯坦福 - Tri意图预测（STIP）数据集。我们在Stip和另一个基准数据集上进行的实验表明，我们的图形建模框架能够预测行人的意向性，而在Stip上的准确度为79.10％，在\ rev {共同注意自动驾驶（JAAD）数据集的共同注意的情况下，最多比实际交叉更早进行了一秒。这些结果的表现优于基线和先前的工作。有关数据集和代码，请参阅http://stip.stanford.edu/。

Reasoning over visual data is a desirable capability for robotics and vision-based applications. Such reasoning enables forecasting of the next events or actions in videos. In recent years, various models have been developed based on convolution operations for prediction or forecasting, but they lack the ability to reason over spatiotemporal data and infer the relationships of different objects in the scene. In this paper, we present a framework based on graph convolution to uncover the spatiotemporal relationships in the scene for reasoning about pedestrian intent. A scene graph is built on top of segmented object instances within and across video frames. Pedestrian intent, defined as the future action of crossing or not-crossing the street, is a very crucial piece of information for autonomous vehicles to navigate safely and more smoothly. We approach the problem of intent prediction from two different perspectives and anticipate the intention-to-cross within both pedestrian-centric and location-centric scenarios. In addition, we introduce a new dataset designed specifically for autonomous-driving scenarios in areas with dense pedestrian populations: the Stanford-TRI Intent Prediction (STIP) dataset. Our experiments on STIP and another benchmark dataset show that our graph modeling framework is able to predict the intention-to-cross of the pedestrians with an accuracy of 79.10% on STIP and 79.28% on \rev{Joint Attention for Autonomous Driving (JAAD) dataset up to one second earlier than when the actual crossing happens. These results outperform the baseline and previous work. Please refer to http://stip.stanford.edu/ for the dataset and code.

下载PDF全文

下载文献需遵守相关版权规定

论文标题