论文标题
Speech2Video合成3D骨骼正则化和表现性身体姿势
Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses
论文作者
论文摘要
在本文中,我们提出了一种新颖的方法,可以将语音音频转换为特定人的照片真实的讲话视频,其中输出视频具有同步,现实和表现力丰富的身体动态。我们通过首先使用复发性神经网络(RNN)从音频序列中生成3D骨架运动,然后通过条件生成对抗网络(GAN)合成输出视频。为了使骨骼运动现实和表现力,我们嵌入了铰接的3D人类骨骼的知识,并在学习和测试管道中嵌入了个人言语标志性手势的知识词典中。前者防止了不合理的身体失真的产生,而后者则可以通过一些录制的视频迅速学习有意义的身体运动。为了产生带有运动细节的照片真实和高分辨率视频,我们建议将部分注意机制插入条件gan中,例如,每个详细的部分,例如头和手会自动缩放以拥有自己的歧视者。为了验证我们的方法,我们收集了一个数据集,其中包含来自1个男性和1个女性模型的20个高质量视频,以不同的主题阅读各种文档。与以前的SOTA管道处理类似任务相比,我们的方法通过用户研究获得了更好的结果。
In this paper, we propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person, where the output video has synchronized, realistic, and expressive rich body dynamics. We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN), and then synthesizing the output video via a conditional generative adversarial network (GAN). To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process in both learning and testing pipelines. The former prevents the generation of unreasonable body distortion, while the later helps our model quickly learn meaningful body movement through a few recorded videos. To produce photo-realistic and high-resolution video with motion details, we propose to insert part attention mechanisms in the conditional GAN, where each detailed part, e.g. head and hand, is automatically zoomed in to have their own discriminators. To validate our approach, we collect a dataset with 20 high-quality videos from 1 male and 1 female model reading various documents under different topics. Compared with previous SoTA pipelines handling similar tasks, our approach achieves better results by a user study.