论文标题

R-螺旋网:神经TTS的梅光谱建模减少

R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

论文作者

Kastner, Kyle, Courville, Aaron

论文摘要

本文介绍了R-Melnet,这是一个由两部分自回归体系结构,其前端基于Melnet的第一层,以及用于神经文本到语音综合的后端Wavernn风格的音频解码器。作为输入的字符和音素的混合序列,该模型具有可选的音频启动序列,可产生低分辨率的旋光特征,这些特征被插值并由Wavernn解码器用于产生音频波形。再加上一半的精度训练,R-螺旋网络在单个商品GPU(NVIDIA 2080TI)上使用11千兆字节的GPU存储器。我们详细介绍了稳定的半精度训练的许多关键实施细节,包括物流注意力的大约,数值稳定的混合物。使用随机的每个样本推理方案多样本样本,所得模型会生成高度变化的音频,同时启用基于文本和音频的控件来修改输出波形。对在单个扬声器TTS数据集进行培训的R循环系统的定性和定量评估证明了我们方法的有效性。

This paper introduces R-MelNet, a two-part autoregressive architecture with a frontend based on the first tier of MelNet and a backend WaveRNN-style audio decoder for neural text-to-speech synthesis. Taking as input a mixed sequence of characters and phonemes, with an optional audio priming sequence, this model produces low-resolution mel-spectral features which are interpolated and used by a WaveRNN decoder to produce an audio waveform. Coupled with half precision training, R-MelNet uses under 11 gigabytes of GPU memory on a single commodity GPU (NVIDIA 2080Ti). We detail a number of critical implementation details for stable half precision training, including an approximate, numerically stable mixture of logistics attention. Using a stochastic, multi-sample per step inference scheme, the resulting model generates highly varied audio, while enabling text and audio based controls to modify output waveforms. Qualitative and quantitative evaluations of an R-MelNet system trained on a single speaker TTS dataset demonstrate the effectiveness of our approach.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源