VISINGER 2：高保真端到端的歌声综合通过数字信号处理增强

论文标题

VISINGER 2：高保真端到端的歌声综合通过数字信号处理增强

VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer

论文作者

Zhang, Yongmao, Xue, Heyang, Li, Hanzhao, Xie, Lei, Guo, Tingwei, Zhang, Ruixiong, Gong, Caixia

论文摘要

端到端唱歌语音合成（SVS）模型Visinger可以比具有较少参数的典型两阶段模型获得更好的性能。但是，维斯林格有几个问题：文本到相的问题，端到端模型学习了文本到相的毫无意义的映射；故障问题，与声音段的周期性信号相对应的谐波成分发生了突然的变化。低采样率，24kHz的采样率不满足全带速率（44.1kHz或更高）的高保真产生的应用需求。在本文中，我们建议Visinger 2通过将数字信号处理（DSP）方法与Visinger整合在一起来解决这些问题。具体而言，受到可区分数字信号处理（DDSP）的最新进展的启发，我们将DSP合成器纳入解码器以解决上述问题。 DSP合成器由谐波合成器和噪声合成器组成，分别从Visinger中的潜在表示z生成周期性和周期性信号。它监督后验编码器在没有阶段信息的情况下提取潜在表示，并避免先前的编码器建模文本对相位映射。为了避免故障人工伪像，修改了Hifi-GAN，以接受DSP合成器生成的波形作为产生唱歌声音的条件。此外，随着改进的波形解码器，Visinger 2设法产生44.1kHz的唱歌音频，表达更丰富，质量更高。 OpencPop语料库的实验表明，Visinger 2在主观和客观指标中的表现优于Visinger，Cpoppopsing and Coftinesinger。

End-to-end singing voice synthesis (SVS) model VISinger can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase; glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change with audible artefacts; low sampling rate, the sampling rate of 24KHz does not meet the application needs of high-fidelity generation with the full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to address these issues by integrating the digital signal processing (DSP) methods with VISinger. Specifically, inspired by recent advances in differentiable digital signal processing (DDSP), we incorporate a DSP synthesizer into the decoder to solve the above issues. The DSP synthesizer consists of a harmonic synthesizer and a noise synthesizer to generate periodic and aperiodic signals, respectively, from the latent representation z in VISinger. It supervises the posterior encoder to extract the latent representation without phase information and avoid the prior encoder modelling text-to-phase mapping. To avoid glitch artefacts, the HiFi-GAN is modified to accept the waveforms generated by the DSP synthesizer as a condition to produce the singing voice. Moreover, with the improved waveform decoder, VISinger 2 manages to generate 44.1kHz singing audio with richer expression and better quality. Experiments on OpenCpop corpus show that VISinger 2 outperforms VISinger, CpopSing and RefineSinger in both subjective and objective metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题