单眼人网恢复的合成训练

论文标题

单眼人网恢复的合成训练

Synthetic Training for Monocular Human Mesh Recovery

论文作者

Sun, Yu, Bao, Qian, Liu, Wu, Gao, Wenpeng, Fu, Yili, Gan, Chuang, Mei, Tao

论文摘要

从单眼图像中恢复3D人网是计算机视觉中的一个流行主题，并且具有广泛的应用。本文旨在估计与单个RGB图像有大规模差异的多个身体部位（例如身体，手）的3D网格。现有方法主要基于迭代优化，这非常耗时。我们建议训练单杆模型以实现这一目标。主要的挑战是缺少培训数据，这些数据在2D图像中具有完整的3D注释。为了解决此问题，我们设计了一个多分支框架，以拆除不同身体性能的回归，使我们能够使用可用的不配对数据以合成的训练方式以合成的训练方式分离每个组件的训练。此外，为了增强概括能力，大多数现有方法都使用了野外2D姿势数据集来监督估计的3D姿势通过3D到2D投影。但是，我们观察到，常用的弱点模型在处理摄像机投影的外部预先处理效果方面表现差。因此，我们提出了一个深度到尺度（D2S）的投影，将深度差异纳入投影函数，以得出每关节尺度变体以进行更正确的监督。根据评估结果，所提出的方法优于CMU Panoptic Studio数据集上的先前方法，并在人为360万的身体和STB手工基准上取得了可比的结果。更令人印象深刻的是，使用拟议的D2S投影在弱监督下，近拍图像中的性能得到了显着提高，同时在计算效率方面保持了明显的优势。

Recovering 3D human mesh from monocular images is a popular topic in computer vision and has a wide range of applications. This paper aims to estimate 3D mesh of multiple body parts (e.g., body, hands) with large-scale differences from a single RGB image. Existing methods are mostly based on iterative optimization, which is very time-consuming. We propose to train a single-shot model to achieve this goal. The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images. To solve this problem, we design a multi-branch framework to disentangle the regression of different body properties, enabling us to separate each component's training in a synthetic training manner using unpaired data available. Besides, to strengthen the generalization ability, most existing methods have used in-the-wild 2D pose datasets to supervise the estimated 3D pose via 3D-to-2D projection. However, we observe that the commonly used weak-perspective model performs poorly in dealing with the external foreshortening effect of camera projection. Therefore, we propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants for more proper supervision. The proposed method outperforms previous methods on the CMU Panoptic Studio dataset according to the evaluation results and achieves comparable results on the Human3.6M body and STB hand benchmarks. More impressively, the performance in close shot images gets significantly improved using the proposed D2S projection for weak supervision, while maintains obvious superiority in computational efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题