共同利用先验结构和手语视频的时间一致性

论文标题

共同利用先验结构和手语视频的时间一致性

Jointly Harnessing Prior Structures and Temporal Consistency for Sign Language Video Generation

论文作者

Suo, Yucheng, Zheng, Zhedong, Wang, Xiaohan, Zhang, Bang, Yang, Yi

论文摘要

手语是人们表达自己的感受和情感的不同能力的窗口。但是，人们在短时间内学习手语仍然具有挑战性。为了应对这项现实世界中的挑战，在这项工作中，我们研究了运动转移系统，该系统可以将用户照片转移到特定单词的手语视频中。特别是，输出视频的外观内容来自提供的用户图像，而视频的运动是从指定的教程视频中提取的。我们观察到采用最先进的运动转移方法来产生语言的两个主要局限性：（1）现有的运动转移工作忽略了人体的先前几何知识。（2）先前的图像动画方法仅将图像对作为训练阶段的输入，这无法完全利用视频中的时间信息。为了解决上述局限性，我们提出了结构感知的时间一致性网络（STCNET），以共同优化人类的先前结构，并具有符号语言视频生成的时间一致性。本文有两个主要贡献。（1）我们利用细粒的骨骼检测器来提供人体关键点的先验知识。通过这种方式，我们确保关键点运动在有效范围内，并使模型变得更加可解释和强大。（2）我们引入了两个周期矛盾损失，即短期周期损失和长期周期损失，以确保生成的视频的连续性。我们以端到端的方式优化了两个损失和关键点检测器网络。

Sign language is the window for people differently-abled to express their feelings as well as emotions. However, it remains challenging for people to learn sign language in a short time. To address this real-world challenge, in this work, we study the motion transfer system, which can transfer the user photo to the sign language video of specific words. In particular, the appearance content of the output video comes from the provided user image, while the motion of the video is extracted from the specified tutorial video. We observe two primary limitations in adopting the state-of-the-art motion transfer methods to sign language generation:(1) Existing motion transfer works ignore the prior geometrical knowledge of the human body. (2) The previous image animation methods only take image pairs as input in the training stage, which could not fully exploit the temporal information within videos. In an attempt to address the above-mentioned limitations, we propose Structure-aware Temporal Consistency Network (STCNet) to jointly optimize the prior structure of human with the temporal consistency for sign language video generation. There are two main contributions in this paper. (1) We harness a fine-grained skeleton detector to provide prior knowledge of the body keypoints. In this way, we ensure the keypoint movement in a valid range and make the model become more explainable and robust. (2) We introduce two cycle-consistency losses, i.e., short-term cycle loss and long-term cycle loss, which are conducted to assure the continuity of the generated video. We optimize the two losses and keypoint detector network in an end-to-end manner.

下载PDF全文

下载文献需遵守相关版权规定

论文标题