基于文本到语音综合的快速格里芬林的波形生成策略

论文标题

基于文本到语音综合的快速格里芬林的波形生成策略

Fast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis

论文作者

Sharma, Ankit, Kumar, Puneet, Maddukuri, Vikas, Madamshettib, Nagasai, KG, Kishore, Kavurub, Sahit Sai Sriram, Raman, Balasubramanian, Roy, Partha Pratim

论文摘要

文本到语音（TTS）系统的性能在很大程度上取决于频谱图，也被称为语音重建阶段。同时所需的时间称为合成延迟。在本文中，已经提出了一种减少语音合成延迟的方法。它旨在增强用于实时应用的TTS系统，例如数字助手，手机，嵌入式设备等。拟议的方法应用了快速的Griffin Lim算法（FGLA），而不是Griffin Lim Algorithm（GLA）在语音合成阶段中作为Vocoder。 GLA和FGLA都是迭代的，但是FGLA的收敛速度比GLA快。提出的方法已在LJSpeech，暴风雪和Tatoeba数据集上进行了测试，并且FGLA的结果与GLA和基于GLA和神经生成的对抗网络（GAN）的Vocoder进行了比较。根据综合延迟和语音质量评估性能。已经观察到语音合成延迟减少了36.58％。输出语音的质量提高了，这是由较高的平均意见分数（MOS）和与FGLA更快的收敛性而不是GLA所提倡的。

The performance of text-to-speech (TTS) systems heavily depends on spectrogram to waveform generation, also known as the speech reconstruction phase. The time required for the same is known as synthesis delay. In this paper, an approach to reduce speech synthesis delay has been proposed. It aims to enhance the TTS systems for real-time applications such as digital assistants, mobile phones, embedded devices, etc. The proposed approach applies Fast Griffin Lim Algorithm (FGLA) instead Griffin Lim algorithm (GLA) as vocoder in the speech synthesis phase. GLA and FGLA are both iterative, but the convergence rate of FGLA is faster than GLA. The proposed approach is tested on LJSpeech, Blizzard and Tatoeba datasets and the results for FGLA are compared against GLA and neural Generative Adversarial Network (GAN) based vocoder. The performance is evaluated based on synthesis delay and speech quality. A 36.58% reduction in speech synthesis delay has been observed. The quality of the output speech has improved, which is advocated by higher Mean opinion scores (MOS) and faster convergence with FGLA as opposed to GLA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题