学习2sing：目标扬声器通过向歌唱老师学习而唱歌的语音综合

论文标题

学习2sing：目标扬声器通过向歌唱老师学习而唱歌的语音综合

Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher

论文作者

Xue, Heyang, Yang, Shan, Lei, Yi, Xie, Lei, Li, Xiulin

论文摘要

随着语音合成区域的快速发展，唱歌声音综合的关注已引起人们的注意。一般而言，通常必须使用与歌词和音乐相关的转录来产生自然的歌声，通常是必要的。但是，这样的语料库很难收集，因为我们许多人很难像专业歌手一样唱歌。在本文中，我们提出了一种方法 - Learn2sing只需要一位歌唱老师就可以在没有歌声数据的情况下生成目标扬声器的歌声。在我们的方法中，教师的唱歌语料库和来自多个目标扬声器的演讲接受了框架级自动回归声学模型的培训，在该模型中，唱歌和演讲共享常见的扬声器嵌入和样式标签嵌入。同时，由于目标扬声器没有与音乐相关的转录，因此我们使用对数尺度的基本频率（LF0）作为辅助特征作为声学模型的输入来构建统一的输入表示。为了使目标扬声器能够在推理阶段唱歌而无需唱歌参考音频，还训练了持续时间模型和LF0预测模型。特别是，我们在声学模型中采用了域对抗训练（DAT），该模型旨在通过从唱歌和说话数据的声学特征中解脱出风格来提高目标扬声器的歌唱性能。我们的实验表明，仅考虑其语音样本，提出的方法能够为目标扬声器综合唱歌声音。

Singing voice synthesis has been paid rising attention with the rapid development of speech synthesis area. In general, a studio-level singing corpus is usually necessary to produce a natural singing voice from lyrics and music-related transcription. However, such a corpus is difficult to collect since it's hard for many of us to sing like a professional singer. In this paper, we propose an approach -- Learn2Sing that only needs a singing teacher to generate the target speakers' singing voice without their singing voice data. In our approach, a teacher's singing corpus and speech from multiple target speakers are trained in a frame-level auto-regressive acoustic model where singing and speaking share the common speaker embedding and style tag embedding. Meanwhile, since there is no music-related transcription for the target speaker, we use log-scale fundamental frequency (LF0) as an auxiliary feature as the inputs of the acoustic model for building a unified input representation. In order to enable the target speaker to sing without singing reference audio in the inference stage, a duration model and an LF0 prediction model are also trained. Particularly, we employ domain adversarial training (DAT) in the acoustic model, which aims to enhance the singing performance of target speakers by disentangling style from acoustic features of singing and speaking data. Our experiments indicate that the proposed approach is capable of synthesizing singing voice for target speaker given only their speech samples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题