多演讲者端到端语音综合的预处理策略，波形模型选择和声学配置

论文标题

多演讲者端到端语音综合的预处理策略，波形模型选择和声学配置

Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

论文作者

Cooper, Erica, Wang, Xin, Zhao, Yi, Yasuda, Yusuke, Yamagishi, Junichi

论文摘要

我们探讨了预处理的策略，包括选择基本语料库，目的是为零发出的多演讲者端到端合成选择最佳策略。我们还检查了神经声码器的选择用于波形合成，以及用于MEL频谱图和最终音频输出的声学配置。我们发现，从发现的有声读物数据中对多扬声器模型进行微调模型，该模型通过简单的质量阈值可以提高自然性和相似性，而不是看到合成语音的目标扬声器。此外，我们发现听众可以辨别16kHz和24kHz采样率，并且Wavernn产生的输出波形与WaveNet具有可比的质量，并且推理时间更快。

We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a simple quality threshold can improve naturalness and similarity to unseen target speakers of synthetic speech. Additionally, we find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet, with a faster inference time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题