论文标题
通过多因素约束提供低资源语音转换的口语风格
Delivering Speaking Style in Low-resource Voice Conversion with Multi-factor Constraints
论文作者
论文摘要
传达语言内容并保持源语音的说话风格(例如语调和情感)对于语音转换(VC)至关重要。但是,在低资源的情况下,只有目标发言人的话语才能访问,现有的VC方法很难满足此要求并捕获目标扬声器的木材。在这项工作中,为低资源VC任务提出了一种新型的VC模型,称为MFC-StyleVC。具体而言,新建议通过聚类方法生成的扬声器音色限制来指导目标扬声器在不同阶段的Timbre学习。同时,为了防止过度适合目标说话者的数据有限的数据,感知正则化约束明确地在特定方面(包括口语样式,语言内容和语音质量)明确维护模型性能。此外,还引入了模拟模式以模拟推理过程,以减轻训练和推理之间的不匹配。对高表达语音进行的广泛实验证明了在低资源VC中提出的方法的优越性。
Conveying the linguistic content and maintaining the source speech's speaking style, such as intonation and emotion, is essential in voice conversion (VC). However, in a low-resource situation, where only limited utterances from the target speaker are accessible, existing VC methods are hard to meet this requirement and capture the target speaker's timber. In this work, a novel VC model, referred to as MFC-StyleVC, is proposed for the low-resource VC task. Specifically, speaker timbre constraint generated by clustering method is newly proposed to guide target speaker timbre learning in different stages. Meanwhile, to prevent over-fitting to the target speaker's limited data, perceptual regularization constraints explicitly maintain model performance on specific aspects, including speaking style, linguistic content, and speech quality. Besides, a simulation mode is introduced to simulate the inference process to alleviate the mismatch between training and inference. Extensive experiments performed on highly expressive speech demonstrate the superiority of the proposed method in low-resource VC.