在基于视频的运动捕获之前学习变分运动

论文标题

在基于视频的运动捕获之前学习变分运动

Learning Variational Motion Prior for Video-based Motion Capture

论文作者

Chen, Xin, Su, Zhuo, Yang, Lingbo, Cheng, Pei, Xu, Lan, Fu, Bin, Yu, Gang

论文摘要

来自单眼视频的运动捕获对我们人类在虚拟现实（VR）和增强现实（AR）中自然体验和互动至关重要，并且至关重要。但是，由于缺乏有效运动事先建模，现有方法仍在涉及涉及自我概括和复杂姿势的具有挑战性的案例中困难。在本文中，我们为基于视频的运动捕获来解决上述问题，提出了一种新颖的变异运动（VMP）学习方法。我们建议不是直接建立视频和运动域之间的对应关系，而是建议学习一个通用潜在空间，以捕获所有自然动作的先前分布，这是后续基于视频的运动捕获任务的基础。为了提高先前空间的概括能力，我们提出了一个基于变压器的变异自动编码器，该自动编码器已在基于标记的3D MOCAP数据上预估计，并具有新型的样式映射块，以提高生成质量。之后，将一个单独的视频编码器附加到验证的运动生成器上，以在特定于任务的视频数据集上进行端到端微调。与现有的运动先验模型相比，我们的VMP模型是运动整流器，可以有效地减少框架姿势估计中的时间抖动和故障模式，从而导致时间稳定且视觉上现实的运动捕获结果。此外，我们基于VMP的框架模型在序列级别运动，并可以在向前传球中直接生成运动夹，从而在推理过程中实现实时运动捕获。对公共数据集和野外视频的广泛实验已经证明了我们框架的功效和概括能力。

Motion capture from a monocular video is fundamental and crucial for us humans to naturally experience and interact with each other in Virtual Reality (VR) and Augmented Reality (AR). However, existing methods still struggle with challenging cases involving self-occlusion and complex poses due to the lack of effective motion prior modeling. In this paper, we present a novel variational motion prior (VMP) learning approach for video-based motion capture to resolve the above issue. Instead of directly building the correspondence between the video and motion domain, We propose to learn a generic latent space for capturing the prior distribution of all natural motions, which serve as the basis for subsequent video-based motion capture tasks. To improve the generalization capacity of prior space, we propose a transformer-based variational autoencoder pretrained over marker-based 3D mocap data, with a novel style-mapping block to boost the generation quality. Afterward, a separate video encoder is attached to the pretrained motion generator for end-to-end fine-tuning over task-specific video datasets. Compared to existing motion prior models, our VMP model serves as a motion rectifier that can effectively reduce temporal jittering and failure modes in frame-wise pose estimation, leading to temporally stable and visually realistic motion capture results. Furthermore, our VMP-based framework models motion at sequence level and can directly generate motion clips in the forward pass, achieving real-time motion capture during inference. Extensive experiments over both public datasets and in-the-wild videos have demonstrated the efficacy and generalization capability of our framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题