联合形式：单帧提升变压器，具有错误预测和改进的3D人姿势估计

论文标题

联合形式：单帧提升变压器，具有错误预测和改进的3D人姿势估计

Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for 3D Human Pose Estimation

论文作者

Lutz, Sebastian, Blythman, Richard, Ghosal, Koustav, Moynihan, Matthew, Simms, Ciaran, Smolic, Aljosa

论文摘要

单眼3D人姿势估计技术有可能大大增加人类运动数据的可用性。单位图2D-3D提升使用图卷积网络（GCN）的表现最佳模型，通常需要一些手动输入来定义不同的身体接头之间的关系。我们提出了一种基于变压器的新方法，该方法使用更广泛的自我发场机制来学习代表关节的代币序列。我们发现，使用中间监督以及堆叠编码器的剩余连接效果。我们还建议，将错误预测作为多任务学习框架的一部分，可以通过允许网络弥补其置信度来提高性能。我们进行广泛的消融研究，以表明我们的每种贡献都会提高性能。此外，我们表明我们的方法的表现超过了最新的单帧3D人类姿势估计的最新技术状态。我们的代码和训练有素的模型可在GitHub上公开提供。

Monocular 3D human pose estimation technologies have the potential to greatly increase the availability of human movement data. The best-performing models for single-image 2D-3D lifting use graph convolutional networks (GCNs) that typically require some manual input to define the relationships between different body joints. We propose a novel transformer-based approach that uses the more generalised self-attention mechanism to learn these relationships within a sequence of tokens representing joints. We find that the use of intermediate supervision, as well as residual connections between the stacked encoders benefits performance. We also suggest that using error prediction as part of a multi-task learning framework improves performance by allowing the network to compensate for its confidence level. We perform extensive ablation studies to show that each of our contributions increases performance. Furthermore, we show that our approach outperforms the recent state of the art for single-frame 3D human pose estimation by a large margin. Our code and trained models are made publicly available on Github.

下载PDF全文

下载文献需遵守相关版权规定

论文标题