论文标题
Learn2sing 2.0:通过向歌唱老师学习,扩散和基于信息的目标扬声器SVS
Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher
论文作者
论文摘要
为一个不擅长唱歌的人建立高质量的歌唱语料库是非平凡的,因此为这个人创建歌声合成器而具有挑战性。 Learn2sing致力于通过从他人记录的数据(即歌唱老师记录的数据中)学习来综合说话者的歌声,而无需他或她的歌声数据。拟议的Learn2sing 2.0首先,启发了音调是区分唱歌与说话语音的关键因素,首先生成初步的声学特征,并在电话级别中平均音高值,这使该过程训练此过程的不同样式(即说话或唱歌),除了分享相同的扬声器信息外。然后,以特定样式为条件,扩散解码器在推理阶段通过快速采样算法加速了扩散解码器,以逐渐恢复最终的声学特征。在培训期间,为了避免说话者嵌入和样式嵌入的信息混乱,使用相互信息来限制说话者嵌入和样式嵌入的学习。实验表明,所提出的方法能够为目标扬声器合成高质量的歌声,而无需使用10个解码步骤唱歌数据。
Building a high-quality singing corpus for a person who is not good at singing is non-trivial, thus making it challenging to create a singing voice synthesizer for this person. Learn2Sing is dedicated to synthesizing the singing voice of a speaker without his or her singing data by learning from data recorded by others, i.e., the singing teacher. Inspired by the fact that pitch is the key style factor to distinguish singing from speaking voice, the proposed Learn2Sing 2.0 first generates the preliminary acoustic feature with averaged pitch value in the phone level, which allows the training of this process for different styles, i.e., speaking or singing, share same conditions except for the speaker information. Then, conditioned on the specific style, a diffusion decoder, which is accelerated by a fast sampling algorithm during the inference stage, is adopted to gradually restore the final acoustic feature. During the training, to avoid the information confusion of the speaker embedding and the style embedding, mutual information is employed to restrain the learning of speaker embedding and style embedding. Experiments show that the proposed approach is capable of synthesizing high-quality singing voice for the target speaker without singing data with 10 decoding steps.