使用体重自适应实例归一化朝着低资源的Stargan语音转换

论文标题

使用体重自适应实例归一化朝着低资源的Stargan语音转换

Towards Low-Resource StarGAN Voice Conversion using Weight Adaptive Instance Normalization

论文作者

Chen, Mingjie, Shi, Yanpei, Hain, Thomas

论文摘要

近年来，通过非平行培训数据进行多个对数量的语音转换取得了重大进展。基于Stargan的模型一直是语音转换的兴趣。但是，大多数基于Stargan的方法仅着眼于语音转换实验，对于扬声器数量很少的情况，并且培训数据的量很大。在这项工作中，我们旨在提高模型的数据效率，并为相对较大的培训样本有限的扬声器实现多到许多基于Stargan的语音转换。为了提高数据效率，提出的模型使用扬声器编码器来提取说话者的嵌入并进行卷积重量的自适应实例归一化（ADAIN）。实验是在两个低资源情况下的109位扬声器进行的，其中训练样本的数量为每个扬声器20和5。客观评估表明，所提出的模型比基线方法更好。此外，主观评估表明，对于自然性和相似性，所提出的模型都优于基线方法。

Many-to-many voice conversion with non-parallel training data has seen significant progress in recent years. StarGAN-based models have been interests of voice conversion. However, most of the StarGAN-based methods only focused on voice conversion experiments for the situations where the number of speakers was small, and the amount of training data was large. In this work, we aim at improving the data efficiency of the model and achieving a many-to-many non-parallel StarGAN-based voice conversion for a relatively large number of speakers with limited training samples. In order to improve data efficiency, the proposed model uses a speaker encoder for extracting speaker embeddings and conducts adaptive instance normalization (AdaIN) on convolutional weights. Experiments are conducted with 109 speakers under two low-resource situations, where the number of training samples is 20 and 5 per speaker. An objective evaluation shows the proposed model is better than the baseline methods. Furthermore, a subjective evaluation shows that, for both naturalness and similarity, the proposed model outperforms the baseline method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题