论文标题
多对象相互作用的时空作用检测
Spatio-Temporal Action Detection with Multi-Object Interaction
论文作者
论文摘要
视频中的时空动作检测需要以“动作管”的形式在空间和时间上定位动作。如今,大多数时空动作检测数据集(例如UCF101-24,AVA,DALY)都用包含执行动作的单个人的动作管注释,因此,主要的动作检测模型仅采用人员检测和跟踪本地化管道来进行定位。但是,当操作定义为多个对象之间的交互时,此类方法可能会失败,因为操作管中的每个边界框都包含多个对象而不是一个人。在本文中,我们研究了多对象相互作用的时空作用检测问题。我们介绍了一个新的数据集,该数据集用包含多目标交互的动作管注释。此外,我们提出了一个同时执行空间和时间回归的端到端时空动作检测模型。我们的空间回归可能包围参与动作的多个对象。在测试时间内,我们只需使用简单的启发式即可在预测的时间持续时间内连接回归的边界框。我们在此新数据集上报告了我们提出的模型的基线结果,并仅使用RGB输入在标准基准UCF101-24上显示竞争结果。
Spatio-temporal action detection in videos requires localizing the action both spatially and temporally in the form of an "action tube". Nowadays, most spatio-temporal action detection datasets (e.g. UCF101-24, AVA, DALY) are annotated with action tubes that contain a single person performing the action, thus the predominant action detection models simply employ a person detection and tracking pipeline for localization. However, when the action is defined as an interaction between multiple objects, such methods may fail since each bounding box in the action tube contains multiple objects instead of one person. In this paper, we study the spatio-temporal action detection problem with multi-object interaction. We introduce a new dataset that is annotated with action tubes containing multi-object interactions. Moreover, we propose an end-to-end spatio-temporal action detection model that performs both spatial and temporal regression simultaneously. Our spatial regression may enclose multiple objects participating in the action. During test time, we simply connect the regressed bounding boxes within the predicted temporal duration using a simple heuristic. We report the baseline results of our proposed model on this new dataset, and also show competitive results on the standard benchmark UCF101-24 using only RGB input.