论文标题

SVT:可扩展的视频到语音综合

SVTS: Scalable Video-to-Speech Synthesis

论文作者

Mira, Rodrigo, Haliassos, Alexandros, Petridis, Stavros, Schuller, Björn W., Pantic, Maja

论文摘要

视频到语音综合(也称为Lip-speech)是指沉默的唇部动作转换为相应的音频。由于其自我监督的性质(即可以在无需手动标记的情况下训练)以及在线可用的视听数据的收集不断增长,因此该任务受到了越来越多的关注。尽管有这些强大的动机,但现代视频到语音的工作主要集中在中小型语料库上,在词汇和环境中都有很大的限制。在这项工作中,我们引入了一个可扩展的视频到语音框架,该框架由两个组成部分组成:视频到光谱图预测指标和一个预训练的神经声码器,该框架将MEL频谱图转换为波形音频。我们在LRW上取得了最先进的效果,并且在LRW上的表现要优于以前的方法。更重要的是,通过使用简单的馈电模型专注于频谱图预测,我们可以有效地将方法扩展到非常不受约束的数据集:据我们所知,我们是第一个在具有挑战性的LRS3数据集上显示出可理解的结果。

Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contemporary video-to-speech works focus mainly on small- to medium-sized corpora with substantial constraints in both vocabulary and setting. In this work, we introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio. We achieve state-of-the art results for GRID and considerably outperform previous approaches on LRW. More importantly, by focusing on spectrogram prediction using a simple feedforward model, we can efficiently and effectively scale our method to very large and unconstrained datasets: To the best of our knowledge, we are the first to show intelligible results on the challenging LRS3 dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源