榴莲SC：持续时间知情的基于注意力网络的歌声转换系统

论文标题

榴莲SC：持续时间知情的基于注意力网络的歌声转换系统

DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System

论文作者

Zhang, Liqiang, Yu, Chengzhu, Lu, Heng, Weng, Chao, Zhang, Chunlei, Wu, Yusong, Xie, Xiang, Li, Zijin, Yu, Dong

论文摘要

唱歌的语音转换正在将源唱歌中的音色转换为目标扬声器的声音，同时保持唱歌内容相同。但是，与普通语音数据相比，针对目标扬声器的唱歌数据更难收集。在本文中，我们引入了一种唱歌的语音转换算法，该算法能够仅使用他/她的正常语音数据来产生高质量的目标扬声器的唱歌。首先，我们通过统一标准语音合成系统和唱歌合成系统中使用的功能，将语音的训练和转换过程和唱歌整合到一个框架中。这样，正常的语音数据也可以有助于演唱语音转换训练，从而使唱歌语音转换系统更加健壮，尤其是当唱歌数据库很小时。更重要的是，为了实现单发唱口的语音转换，使用语音和唱歌数据开发了扬声器嵌入模块，该语音和唱歌数据在转换过程中提供了目标扬声器。实验表明拟议的唱歌转换系统可以将源唱歌转换为目标扬声器的高质量唱歌，而目标扬声器的注册语音数据仅20秒。

Singing voice conversion is converting the timbre in the source singing to the target speaker's voice while keeping singing content the same. However, singing data for target speaker is much more difficult to collect compared with normal speech data.In this paper, we introduce a singing voice conversion algorithm that is capable of generating high quality target speaker's singing using only his/her normal speech data. First, we manage to integrate the training and conversion process of speech and singing into one framework by unifying the features used in standard speech synthesis system and singing synthesis system. In this way, normal speech data can also contribute to singing voice conversion training, making the singing voice conversion system more robust especially when the singing database is small.Moreover, in order to achieve one-shot singing voice conversion, a speaker embedding module is developed using both speech and singing data, which provides target speaker identify information during conversion. Experiments indicate proposed sing conversion system can convert source singing to target speaker's high-quality singing with only 20 seconds of target speaker's enrollment speech data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题