论文标题
多演讲者端到端语音综合的预处理策略,波形模型选择和声学配置
Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis
论文作者
论文摘要
我们探讨了预处理的策略,包括选择基本语料库,目的是为零发出的多演讲者端到端合成选择最佳策略。我们还检查了神经声码器的选择用于波形合成,以及用于MEL频谱图和最终音频输出的声学配置。我们发现,从发现的有声读物数据中对多扬声器模型进行微调模型,该模型通过简单的质量阈值可以提高自然性和相似性,而不是看到合成语音的目标扬声器。此外,我们发现听众可以辨别16kHz和24kHz采样率,并且Wavernn产生的输出波形与WaveNet具有可比的质量,并且推理时间更快。
We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a simple quality threshold can improve naturalness and similarity to unseen target speakers of synthetic speech. Additionally, we find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet, with a faster inference time.