VTP：多视图多人3D姿势估计的体积变压器

论文标题

VTP：多视图多人3D姿势估计的体积变压器

VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation

论文作者

Chen, Yuxing, Gu, Renshu, Huang, Ouhan, Jia, Gangyong

论文摘要

本文介绍了体积变压器姿势估计器（VTP），这是用于多人多人3D人姿势估计的第一个3D体积变压器框架。 VTP从所有相机视图中的2D关键点汇总了功能，并直接以端到端的方式在3D Voxel空间中直接学习空间关系。聚集的3D特征通过3D卷积传递，然后将其扁平化为顺序嵌入并馈入变压器。残留结构旨在进一步提高性能。此外，稀疏的sindhorn注意力有能力降低记忆成本，这是体积表示的主要瓶颈，同时还可以实现出色的性能。变压器的输出再次通过残留设计将3D卷积特征加入。提出的VTP框架将变压器的高性能与体积表示相结合，可以用作卷积骨架的良好替代方案。在平均每个关节位置误差（MPJPE）和正确估计的零件（PCP）的百分比方面，架子，校园和CMU综合基准的实验显示出令人鼓舞的结果。我们的代码将可用。

This paper presents Volumetric Transformer Pose estimator (VTP), the first 3D volumetric transformer framework for multi-view multi-person 3D human pose estimation. VTP aggregates features from 2D keypoints in all camera views and directly learns the spatial relationships in the 3D voxel space in an end-to-end fashion. The aggregated 3D features are passed through 3D convolutions before being flattened into sequential embeddings and fed into a transformer. A residual structure is designed to further improve the performance. In addition, the sparse Sinkhorn attention is empowered to reduce the memory cost, which is a major bottleneck for volumetric representations, while also achieving excellent performance. The output of the transformer is again concatenated with 3D convolutional features by a residual design. The proposed VTP framework integrates the high performance of the transformer with volumetric representations, which can be used as a good alternative to the convolutional backbones. Experiments on the Shelf, Campus and CMU Panoptic benchmarks show promising results in terms of both Mean Per Joint Position Error (MPJPE) and Percentage of Correctly estimated Parts (PCP). Our code will be available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题