论文标题

VISINGER 2:高保真端到端的歌声综合通过数字信号处理增强

VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer

论文作者

Zhang, Yongmao, Xue, Heyang, Li, Hanzhao, Xie, Lei, Guo, Tingwei, Zhang, Ruixiong, Gong, Caixia

论文摘要

端到端唱歌语音合成(SVS)模型Visinger可以比具有较少参数的典型两阶段模型获得更好的性能。但是,维斯林格有几个问题:文本到相的问题,端到端模型学习了文本到相的毫无意义的映射;故障问题,与声音段的周期性信号相对应的谐波成分发生了突然的变化。低采样率,24kHz的采样率不满足全带速率(44.1kHz或更高)的高保真产生的应用需求。在本文中,我们建议Visinger 2通过将数字信号处理(DSP)方法与Visinger整合在一起来解决这些问题。具体而言,受到可区分数字信号处理(DDSP)的最新进展的启发,我们将DSP合成器纳入解码器以解决上述问题。 DSP合成器由谐波合成器和噪声合成器组成,分别从Visinger中的潜在表示z生成周期性和周期性信号。它监督后验编码器在没有阶段信息的情况下提取潜在表示,并避免先前的编码器建模文本对相位映射。为了避免故障人工伪像,修改了Hifi-GAN,以接受DSP合成器生成的波形作为产生唱歌声音的条件。此外,随着改进的波形解码器,Visinger 2设法产生44.1kHz的唱歌音频,表达更丰富,质量更高。 OpencPop语料库的实验表明,Visinger 2在主观和客观指标中的表现优于Visinger,Cpoppopsing and Coftinesinger。

End-to-end singing voice synthesis (SVS) model VISinger can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase; glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change with audible artefacts; low sampling rate, the sampling rate of 24KHz does not meet the application needs of high-fidelity generation with the full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to address these issues by integrating the digital signal processing (DSP) methods with VISinger. Specifically, inspired by recent advances in differentiable digital signal processing (DDSP), we incorporate a DSP synthesizer into the decoder to solve the above issues. The DSP synthesizer consists of a harmonic synthesizer and a noise synthesizer to generate periodic and aperiodic signals, respectively, from the latent representation z in VISinger. It supervises the posterior encoder to extract the latent representation without phase information and avoid the prior encoder modelling text-to-phase mapping. To avoid glitch artefacts, the HiFi-GAN is modified to accept the waveforms generated by the DSP synthesizer as a condition to produce the singing voice. Moreover, with the improved waveform decoder, VISinger 2 manages to generate 44.1kHz singing audio with richer expression and better quality. Experiments on OpenCpop corpus show that VISinger 2 outperforms VISinger, CpopSing and RefineSinger in both subjective and objective metrics.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源