视频中人为姿势估计的运动学意识分层注意力网络

论文标题

视频中人为姿势估计的运动学意识分层注意力网络

Kinematic-aware Hierarchical Attention Network for Human Pose Estimation in Videos

论文作者

Jin, Kyung-Min, Lim, Byoung-Sung, Lee, Gun-Hee, Kang, Tae-Kyung, Lee, Seong-Whan

论文摘要

以前的基于视频的人姿势估计方法通过利用连续帧的汇总特征显示出令人鼓舞的结果。但是，大多数方法都会损害准确性，以减轻抖动或不充分理解人类运动的时间方面。此外，连续框架之间的闭塞增加了不确定性，从而导致不平衡的结果。为了解决这些问题，我们设计了一种用以下组件来利用KePoint运动学功能的体系结构。首先，我们通过利用各个关键点的速度和加速度来有效地捕获时间特征。其次，提出的层次变压器编码器汇总时空依赖性并完善了从现有估计器估算的2D或3D输入姿势。最后，我们在编码器产生的精制输入姿势与从解码器产生的最终姿势之间提供了在线跨诉讼，以实现关节优化。我们展示了全面的结果并验证了模型在各种任务中的有效性：2D姿势估计，3D姿势估计，身体网格恢复和稀疏注释的多人类姿势估计。我们的代码可在https://github.com/kyungminjin/hanet上找到。

Previous video-based human pose estimation methods have shown promising results by leveraging aggregated features of consecutive frames. However, most approaches compromise accuracy to mitigate jitter or do not sufficiently comprehend the temporal aspects of human motion. Furthermore, occlusion increases uncertainty between consecutive frames, which results in unsmooth results. To address these issues, we design an architecture that exploits the keypoint kinematic features with the following components. First, we effectively capture the temporal features by leveraging individual keypoint's velocity and acceleration. Second, the proposed hierarchical transformer encoder aggregates spatio-temporal dependencies and refines the 2D or 3D input pose estimated from existing estimators. Finally, we provide an online cross-supervision between the refined input pose generated from the encoder and the final pose from our decoder to enable joint optimization. We demonstrate comprehensive results and validate the effectiveness of our model in various tasks: 2D pose estimation, 3D pose estimation, body mesh recovery, and sparsely annotated multi-human pose estimation. Our code is available at https://github.com/KyungMinJin/HANet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题