Stylemelgan：具有时间自适应归一化的有效的高保真对抗性声码器

论文标题

Stylemelgan：具有时间自适应归一化的有效的高保真对抗性声码器

StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization

论文作者

Mustafa, Ahmed, Pia, Nicola, Fuchs, Guillaume

论文摘要

近年来，神经声码器在合成语音的自然性和感知质量上超过了古典语音生成方法。诸如vavenet和wavellow之类的计算重型模型取得了最佳效果，而轻量级的gan型号，例如梅尔根（Melgan）和平行波甘（Wavegan）在感知质量方面保持劣等。因此，我们提出了一种轻巧的神经声码编码器Stylemelgan，允许综合具有低计算复杂性的高保真语音。 Stylemelgan采用时间自适应归一化来设置具有目标语音的声学特征的低维噪声矢量。为了进行有效的训练，多个随机窗口判别器对手对滤波器库分析的语音信号进行了对抗，并通过多尺度光谱重建损失提供了正则化。在CPU和GPU上，高度可行的语音产生比实时几倍。 Mushra和P.800的听力测试表明，Stylemelgan在拷贝合成和文本到语音的场景中的表现优于先前的神经声码器。

In recent years, neural vocoders have surpassed classical speech generation approaches in naturalness and perceptual quality of the synthesized speech. Computationally heavy models like WaveNet and WaveGlow achieve best results, while lightweight GAN models, e.g. MelGAN and Parallel WaveGAN, remain inferior in terms of perceptual quality. We therefore propose StyleMelGAN, a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity. StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech. For efficient training, multiple random-window discriminators adversarially evaluate the speech signal analyzed by a filter bank, with regularization provided by a multi-scale spectral reconstruction loss. The highly parallelizable speech generation is several times faster than real-time on CPUs and GPUs. MUSHRA and P.800 listening tests show that StyleMelGAN outperforms prior neural vocoders in copy-synthesis and Text-to-Speech scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题