基于示例的视觉语音综合的神经面部模型

论文标题

基于示例的视觉语音综合的神经面部模型

Neural Face Models for Example-Based Visual Speech Synthesis

论文作者

Paier, Wolfgang, Hilsmann, Anna, Eisert, Peter

论文摘要

使用计算机图形模型创建人脸的现实动画仍然是一项艰巨的任务。通常，它可以通过乏味的手动工作或基于运动捕获的技术来解决，这些技术需要专业且昂贵的硬件。基于示例的动画方法通过重复使用真实人的数据来解决这些问题。这些数据被分为短运动样本，这些样本可以被循环或连接以创建新型运动序列。这种方法的明显优势是使用的简单性和高现实主义，因为数据仅表现出实际的变形。动画任务不是通过调整复杂面钻机的重量，而是通过以某种方式安排典型的运动样本来在更高的层面上执行。但是，基于示例的方法的两个困难是高记忆要求以及运动样本之间无伪影和现实的过渡的创建。我们通过将基于示例动画的现实主义和简单性与神经面部模型的优势相结合来解决这些问题。我们的神经面部模型能够根据紧凑的潜在参数向量合成高质量的3D面部几何形状和质地。这种潜在表示将记忆要求减少100倍，并有助于在串联运动样本之间创建无缝过渡。在本文中，我们提出了一种基于多视频视频的面部运动捕获的无标记方法。基于捕获的数据，我们学习了面部表情的神经表示，该神经表达式用于在动画过程中无缝连接的面部表演。我们通过基于视觉查询序列的瑞士 - 德语手语合成嘴巴来证明我们的方法的有效性。

Creating realistic animations of human faces with computer graphic models is still a challenging task. It is often solved either with tedious manual work or motion capture based techniques that require specialised and costly hardware. Example based animation approaches circumvent these problems by re-using captured data of real people. This data is split into short motion samples that can be looped or concatenated in order to create novel motion sequences. The obvious advantages of this approach are the simplicity of use and the high realism, since the data exhibits only real deformations. Rather than tuning weights of a complex face rig, the animation task is performed on a higher level by arranging typical motion samples in a way such that the desired facial performance is achieved. Two difficulties with example based approaches, however, are high memory requirements as well as the creation of artefact-free and realistic transitions between motion samples. We solve these problems by combining the realism and simplicity of example-based animations with the advantages of neural face models. Our neural face model is capable of synthesising high quality 3D face geometry and texture according to a compact latent parameter vector. This latent representation reduces memory requirements by a factor of 100 and helps creating seamless transitions between concatenated motion samples. In this paper, we present a marker-less approach for facial motion capture based on multi-view video. Based on the captured data, we learn a neural representation of facial expressions, which is used to seamlessly concatenate facial performances during the animation procedure. We demonstrate the effectiveness of our approach by synthesising mouthings for Swiss-German sign language based on viseme query sequences.

下载PDF全文

下载文献需遵守相关版权规定

论文标题