Mixste：Seq2Seq混合时空编码器，用于3D人类姿势估计的视频

论文标题

Mixste：Seq2Seq混合时空编码器，用于3D人类姿势估计的视频

MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

论文作者

Zhang, Jinlu, Tu, Zhigang, Yang, Jianyu, Chen, Yujin, Yuan, Junsong

论文摘要

通过考虑全球所有框架之间的身体接头以学习时空相关性，已引入了最近基于变压器的解决方案，以估算2D关键点序列的3D人姿势。我们观察到不同关节的运动有很大差异。但是，先前的方法无法有效地对每个关节的固体间相对应关系进行建模，从而导致对时空相关性的学习不足。我们提出了混合时空编码器（混合时空编码器），它具有时间变压器块，可以单独建模每个接头的时间运动和空间变压器块，以学习关节间的空间相关性。这两个块被交替使用，以获得更好的时空特征编码。此外，网络输出从中心框架扩展到输入视频的整个帧，从而提高了输入序列和输出序列之间的相干性。广泛的实验是在三个基准（Human36M，MPI-INF-3DHP和Humaneva）上进行的。结果表明，我们的模型的表现优于最先进的方法，而MPJPE则以10.9％和7.6％的MPJPE。该代码可在https://github.com/jinluzhang1126/mixste上找到。

Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn spatio-temporal correlation. We observe that the motions of different joints differ significantly. However, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation. We propose MixSTE (Mixed Spatio-Temporal Encoder), which has a temporal transformer block to separately model the temporal motion of each joint and a spatial transformer block to learn inter-joint spatial correlation. These two blocks are utilized alternately to obtain better spatio-temporal feature encoding. In addition, the network output is extended from the central frame to entire frames of the input video, thereby improving the coherence between the input and output sequences. Extensive experiments are conducted on three benchmarks (Human3.6M, MPI-INF-3DHP, and HumanEva). The results show that our model outperforms the state-of-the-art approach by 10.9% P-MPJPE and 7.6% MPJPE. The code is available at https://github.com/JinluZhang1126/MixSTE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题