利用对称的卷积变压器网络进行语音到唱歌语音风格转移

论文标题

利用对称的卷积变压器网络进行语音到唱歌语音风格转移

Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer

论文作者

Agarwal, Shrutina, Ganapathy, Sriram, Takahashi, Naoya

论文摘要

在本文中，我们提出了一个模型，以执行语音转移到歌声。与以前的基于信号处理的方法相反，基于信号处理的方法需要高质量的唱歌模板或音素同步，我们探索了一种数据驱动的方法，即将自然语音转换为唱歌声音的问题。我们开发了一种新型的神经网络结构，称为Symnet，该结构将输入语音与目标旋律的对齐方式进行了建模，同时保留了说话者的身份和自然性。所提出的符号模型由三种类型的层 - 卷积，变压器和自我发项层组成的对称堆栈组成。本文还探讨了新的数据增强和生成损耗退火方法，以促进模型培训。实验是在 NUS和NHSS数据集由语音和歌声的并行数据组成。在这些实验中，我们表明所提出的SYMNET模型在先前发表的方法和基线体系结构上显着提高了客观重建质量。此外，主观听力测试证实了使用拟议方法获得的音频的提高（绝对提高了0.37的平均意见分数测度量度比基线系统）。

In this paper, we propose a model to perform style transfer of speech to singing voice. Contrary to the previous signal processing-based methods, which require high-quality singing templates or phoneme synchronization, we explore a data-driven approach for the problem of converting natural speech to singing voice. We develop a novel neural network architecture, called SymNet, which models the alignment of the input speech with the target melody while preserving the speaker identity and naturalness. The proposed SymNet model is comprised of symmetrical stack of three types of layers - convolutional, transformer, and self-attention layers. The paper also explores novel data augmentation and generative loss annealing methods to facilitate the model training. Experiments are performed on the NUS and NHSS datasets which consist of parallel data of speech and singing voice. In these experiments, we show that the proposed SymNet model improves the objective reconstruction quality significantly over the previously published methods and baseline architectures. Further, a subjective listening test confirms the improved quality of the audio obtained using the proposed approach (absolute improvement of 0.37 in mean opinion score measure over the baseline system).

下载PDF全文

下载文献需遵守相关版权规定

论文标题