带有嘈杂培训数据的基于神经网络的语音转换算法中的元音和韵律贡献

论文标题

带有嘈杂培训数据的基于神经网络的语音转换算法中的元音和韵律贡献

Vowels and Prosody Contribution in Neural Network Based Voice Conversion Algorithm with Noisy Training Data

论文作者

Agbolade, Olaide

论文摘要

这项研究提出了基于神经网络的语音转换（VC）模型。虽然众所周知的事实是，声音和韵律是语音转换框架中最重要的组成部分，但不知道其客观贡献尤其是在嘈杂和不受控制的环境中。该模型使用2层馈电源神经网络来将源扬声器的线性预测分析系数映射到目标扬声器的声学矢量空间，以便客观地确定声音声音的声音，无声和上段的声音组件对语音转换模型的贡献。结果表明，元音“ a”，'i'，'o'在转换成功方面具有最重要的贡献。还发现无声的声音受到嘈杂训练数据的影响最大。相对于声音的声音，发现噪声平均噪声水平为40 dB，可将语音转换成功率降低55.14％。结果还表明，对于跨性别语音转换，在女性是目标扬声器的情况下，韵律转换更为重要。

This research presents a neural network based voice conversion (VC) model. While it is a known fact that voiced sounds and prosody are the most important component of the voice conversion framework, what is not known is their objective contributions particularly in a noisy and uncontrolled environment. This model uses a 2-layer feedforward neural network to map the Linear prediction analysis coefficients of a source speaker to the acoustic vector space of the target speaker with a view to objectively determine the contributions of the voiced, unvoiced and supra-segmental components of sounds to the voice conversion model. Results showed that vowels 'a', 'i', 'o' have the most significant contribution in the conversion success. The voiceless sounds were also found to be most affected by the noisy training data. An average noise level of 40 dB above the noise floor were found to degrade the voice conversion success by 55.14 percent relative to the voiced sounds. The result also shows that for cross-gender voice conversion, prosody conversion is more significant in scenarios where a female is the target speaker.

下载PDF全文

下载文献需遵守相关版权规定

论文标题