TM2T：3D人类动作和文本的随机和令牌建模

论文标题

TM2T：3D人类动作和文本的随机和令牌建模

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

论文作者

Guo, Chuan, Zuo, Xinxin, Wang, Sen, Cheng, Li

论文摘要

受到两种亲密的人类感应和沟通方式之间的牢固联系的启发，我们的论文旨在探索文本中的3D人类全身动作的产生，以及其互惠任务，分别用于文本2Motion和Motion2Text。为了应对现有的挑战，尤其是为了使同一文本产生多个不同的动作，并避免了琐碎的静止姿势序列的不良产生，我们建议使用运动令牌（一种离散和紧凑的运动表示）。当将动作和文本信号视为运动和文本令牌时，这提供了一个级别的游戏地面。此外，我们的Motion2Text模块被整合到我们的文本2Motion训练管道的反对齐过程中，在该管道中，合成文本与输入文本的显着偏差将受到较大的培训损失的惩罚；从经验上讲，这证明可以有效地提高性能。最后，通过将神经模型调整为机器翻译（NMT），促进了动作和文本的两种方式之间的映射。离散运动令牌上的分布的自回归建模进一步从输入文本中实现了可变长度的姿势序列的非确定性产生。我们的方法是灵活的，可以用于Text2Motion和Motion2Text任务。在两个基准数据集上的经验评估表明，在两种任务上，我们的方法在各种最新方法上都具有出色的性能。项目页面：https：//ericguo5513.github.io/tm2t/

Inspired by the strong ties between vision and language, the two intimate human sensing and communication modalities, our paper aims to explore the generation of 3D human full-body motions from texts, as well as its reciprocal task, shorthanded for text2motion and motion2text, respectively. To tackle the existing challenges, especially to enable the generation of multiple distinct motions from the same text, and to avoid the undesirable production of trivial motionless pose sequences, we propose the use of motion token, a discrete and compact motion representation. This provides one level playing ground when considering both motions and text signals, as the motion and text tokens, respectively. Moreover, our motion2text module is integrated into the inverse alignment process of our text2motion training pipeline, where a significant deviation of synthesized text from the input text would be penalized by a large training loss; empirically this is shown to effectively improve performance. Finally, the mappings in-between the two modalities of motions and texts are facilitated by adapting the neural model for machine translation (NMT) to our context. This autoregressive modeling of the distribution over discrete motion tokens further enables non-deterministic production of pose sequences, of variable lengths, from an input text. Our approach is flexible, could be used for both text2motion and motion2text tasks. Empirical evaluations on two benchmark datasets demonstrate the superior performance of our approach on both tasks over a variety of state-of-the-art methods. Project page: https://ericguo5513.github.io/TM2T/

下载PDF全文

下载文献需遵守相关版权规定

论文标题