无监督的3D人姿势估计的运动学结构保留的表示

论文标题

无监督的3D人姿势估计的运动学结构保留的表示

Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation

论文作者

Kundu, Jogendra Nath, Seth, Siddharth, M V, Rahul, Rakesh, Mugalodi, Babu, R. Venkatesh, Chakraborty, Anirban

论文摘要

从单眼图像中对3D人类姿势的估计引起了人们的关注，这是几个以人为中心的应用的关键步骤。但是，人类姿势估计模型的普遍性使用在大规模的工作室数据集上进行监督开发的概括仍然值得怀疑，因为这些模型通常在看不见的野外环境中表现不佳。尽管已经提出了弱监督的模型来解决这一缺点，但此类模型的性能依赖于对某些相关任务（例如2D姿势或多视图图像对）的配对监督的可用性。相比之下，我们提出了一种新型的运动学结构保存的无监督的3D姿势估计框架，该框架不受任何配对或未配对的弱监管的限制。我们的姿势估计框架取决于定义基本运动学3D结构的最低先验知识集，例如骨骼关节连通性信息，具有固定规范尺度的骨长度比率。提出的模型采用了三个连续的可区分变换，称为前进型，摄像机预测和空间图转换。这种设计不仅充当刺激有效姿势脱离的合适瓶颈，而且还产生可解释的潜在姿势表示，避免训练明显的潜在嵌入到姿势映射器中。此外，我们没有不稳定的对手设置，我们重新利用解码器以形式化基于能量的损失，这使我们能够从野外视频中学习，除了实验室环境之外。全面的实验证明了我们在人类360万和MPI-INF-3DHP数据集上无监督和弱监督的姿势估计表现。在看不见的环境上的定性结果进一步建立了我们的卓越概括能力。

Estimation of 3D human pose from monocular image has gained considerable attention, as a key step to several human-centric applications. However, generalizability of human pose estimation models developed using supervision on large-scale in-studio datasets remains questionable, as these models often perform unsatisfactorily on unseen in-the-wild environments. Though weakly-supervised models have been proposed to address this shortcoming, performance of such models relies on availability of paired supervision on some related tasks, such as 2D pose or multi-view image pairs. In contrast, we propose a novel kinematic-structure-preserved unsupervised 3D pose estimation framework, which is not restrained by any paired or unpaired weak supervisions. Our pose estimation framework relies on a minimal set of prior knowledge that defines the underlying kinematic 3D structure, such as skeletal joint connectivity information with bone-length ratios in a fixed canonical scale. The proposed model employs three consecutive differentiable transformations named as forward-kinematics, camera-projection and spatial-map transformation. This design not only acts as a suitable bottleneck stimulating effective pose disentanglement but also yields interpretable latent pose representations avoiding training of an explicit latent embedding to pose mapper. Furthermore, devoid of unstable adversarial setup, we re-utilize the decoder to formalize an energy-based loss, which enables us to learn from in-the-wild videos, beyond laboratory settings. Comprehensive experiments demonstrate our state-of-the-art unsupervised and weakly-supervised pose estimation performance on both Human3.6M and MPI-INF-3DHP datasets. Qualitative results on unseen environments further establish our superior generalization ability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题