R-螺旋网：神经TTS的梅光谱建模减少

论文标题

R-螺旋网：神经TTS的梅光谱建模减少

R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

论文作者

Kastner, Kyle, Courville, Aaron

论文摘要

本文介绍了R-Melnet，这是一个由两部分自回归体系结构，其前端基于Melnet的第一层，以及用于神经文本到语音综合的后端Wavernn风格的音频解码器。作为输入的字符和音素的混合序列，该模型具有可选的音频启动序列，可产生低分辨率的旋光特征，这些特征被插值并由Wavernn解码器用于产生音频波形。再加上一半的精度训练，R-螺旋网络在单个商品GPU（NVIDIA 2080TI）上使用11千兆字节的GPU存储器。我们详细介绍了稳定的半精度训练的许多关键实施细节，包括物流注意力的大约，数值稳定的混合物。使用随机的每个样本推理方案多样本样本，所得模型会生成高度变化的音频，同时启用基于文本和音频的控件来修改输出波形。对在单个扬声器TTS数据集进行培训的R循环系统的定性和定量评估证明了我们方法的有效性。

This paper introduces R-MelNet, a two-part autoregressive architecture with a frontend based on the first tier of MelNet and a backend WaveRNN-style audio decoder for neural text-to-speech synthesis. Taking as input a mixed sequence of characters and phonemes, with an optional audio priming sequence, this model produces low-resolution mel-spectral features which are interpolated and used by a WaveRNN decoder to produce an audio waveform. Coupled with half precision training, R-MelNet uses under 11 gigabytes of GPU memory on a single commodity GPU (NVIDIA 2080Ti). We detail a number of critical implementation details for stable half precision training, including an approximate, numerically stable mixture of logistics attention. Using a stochastic, multi-sample per step inference scheme, the resulting model generates highly varied audio, while enabling text and audio based controls to modify output waveforms. Qualitative and quantitative evaluations of an R-MelNet system trained on a single speaker TTS dataset demonstrate the effectiveness of our approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题