定位：带有变压器的3D动作的端到端定位

论文标题

定位：带有变压器的3D动作的端到端定位

LocATe: End-to-end Localization of Actions in 3D with Transformers

论文作者

Sun, Jiankai, Zhou, Bolei, Black, Michael J., Chandrasekaran, Arjun

论文摘要

从3D运动中了解一个人的行为是计算机视觉中的一个基本问题。该问题的一个重要组成部分是3D时间动作定位（3D-TAL），其中涉及认识一个人正在执行的操作以及何时。最新的3D-TAL方法采用了两阶段的方法，其中操作跨度检测任务和动作识别任务被实现为级联。但是，这种方法限制了错误纠正的可能性。相比之下，我们提出了定位，这是一种端到端的方法，该方法共同定位并识别3D序列的行动。此外，与关注序列建模局部上下文建模的现有自动回归模型不同，定位的变压器模型能够捕获序列中的动作之间的长期相关性。与将图像或补丁特征视为输入的基于变压器的对象检测和分类模型不同，3D-TAL中的输入是高度相关帧的长序列。为了处理高维输入，我们实现了有效的输入表示，并通过在模型中引入稀疏的注意力来克服长期范围内的扩散注意力。在现有的PKU-MMD 3D-TAL基准（MAP = 93.2％）上找到以前的方法。最后，我们认为，在有明确的绩效余地的情况下，基准数据集最有用。为此，我们介绍了一种新的，具有挑战性，更现实的基准数据集，即Babel-Tal-20（BT20），其中最先进方法的性能明显更糟。该方法的数据集和代码将用于研究目的。

Understanding a person's behavior from their 3D motion is a fundamental problem in computer vision with many applications. An important component of this problem is 3D Temporal Action Localization (3D-TAL), which involves recognizing what actions a person is performing, and when. State-of-the-art 3D-TAL methods employ a two-stage approach in which the action span detection task and the action recognition task are implemented as a cascade. This approach, however, limits the possibility of error-correction. In contrast, we propose LocATe, an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence. Further, unlike existing autoregressive models that focus on modeling the local context in a sequence, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence. Unlike transformer-based object-detection and classification models which consider image or patch features as input, the input in 3D-TAL is a long sequence of highly correlated frames. To handle the high-dimensional input, we implement an effective input representation, and overcome the diffuse attention across long time horizons by introducing sparse attention in the model. LocATe outperforms previous approaches on the existing PKU-MMD 3D-TAL benchmark (mAP=93.2%). Finally, we argue that benchmark datasets are most useful where there is clear room for performance improvement. To that end, we introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse. The dataset and code for the method will be available for research purposes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题