小米2：基于生成对抗网络的高保真唱歌语音合成器

论文标题

小米2：基于生成对抗网络的高保真唱歌语音合成器

Xiaoicesing 2: A High-Fidelity Singing Voice Synthesizer Based on Generative Adversarial Network

论文作者

Wang, Chunhui, Zeng, Chang, He, Xing

论文摘要

小米是一种唱歌的语音综合（SVS）系统，旨在产生48kHz的歌声。但是，由于没有特殊的设计来对这些部分的细节进行建模，因此它产生的MEL光谱图在中和高频区域中过度平滑。在本文中，我们提出了Xiaoicesing2，它可以生成中间和高频零件的细节，以更好地构建全频段旋光图。具体来说，为了减轻此问题，Xiaoicesing2采用了生成对抗网络（GAN），该网络由基于快速的发电机和多波段歧视器组成。我们通过与自发块并行添加多个残留卷积块来改善进型变压器（FFT）块，以平衡局部和全局特征。多波段鉴别器分别包含三个负责MEL光谱图的低，中和高频部分的子歧视剂。每个子歧视器都由几个段歧视因子（SD）和详细歧视器（DD）组成，以区分音频与不同方面。我们内部48KHz唱歌语音数据集的实验显示了Xiaoicesing2显着提高了唱歌声音的质量，而不是小字。

XiaoiceSing is a singing voice synthesis (SVS) system that aims at generating 48kHz singing voices. However, the mel-spectrogram generated by it is over-smoothing in middle- and high-frequency areas due to no special design for modeling the details of these parts. In this paper, we propose XiaoiceSing2, which can generate the details of middle- and high-frequency parts to better construct the full-band mel-spectrogram. Specifically, in order to alleviate this problem, XiaoiceSing2 adopts a generative adversarial network (GAN), which consists of a FastSpeech-based generator and a multi-band discriminator. We improve the feed-forward Transformer (FFT) block by adding multiple residual convolutional blocks in parallel with the self-attention block to balance the local and global features. The multi-band discriminator contains three sub-discriminators responsible for low-, middle-, and high-frequency parts of the mel-spectrogram, respectively. Each sub-discriminator is composed of several segment discriminators (SD) and detail discriminators (DD) to distinguish the audio from different aspects. The experiment on our internal 48kHz singing voice dataset shows XiaoiceSing2 significantly improves the quality of the singing voice over XiaoiceSing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题