论文标题
TGAVC:通过文本引导和对抗性培训改善自动编码器语音转换
TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training
论文作者
论文摘要
非平行的多与许多语音转换仍然是一项有趣但具有挑战性的语音处理任务。最近,一种基于有条件的自动编码器的方法AutoVC通过使用信息构成的瓶颈来解散扬声器身份和语音内容,从而实现了出色的转换结果。但是,由于纯粹的自动编码器训练方法,很难评估内容和说话者身份的分离效果。在本文中,提出了一个新颖的语音转换框架,称为$ \ boldsymbol t $ ext $ \ boldsymbol g $ used $ \ boldsymbol a $ utoVC(tgavc),提议更有效地将内容和音色与语音分开,并与语音分开,该内容基于文本转录旨在指导语音内容的启动。此外,对对抗性训练将用于消除从语音中提取的估计内容中的说话者身份信息。在预期的内容嵌入和对抗培训的指导下,对内容编码器进行了培训,可以从语音中提取嵌入说话者的内容。 Aishell-3数据集的实验表明,所提出的模型在自然性和转换语音的相似性方面优于AUTOVC。
Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Recently, AutoVC, a conditional autoencoder based method, achieved excellent conversion results by disentangling the speaker identity and the speech content using information-constraining bottlenecks. However, due to the pure autoencoder training method, it is difficult to evaluate the separation effect of content and speaker identity. In this paper, a novel voice conversion framework, named $\boldsymbol T$ext $\boldsymbol G$uided $\boldsymbol A$utoVC(TGAVC), is proposed to more effectively separate content and timbre from speech, where an expected content embedding produced based on the text transcriptions is designed to guide the extraction of voice content. In addition, the adversarial training is applied to eliminate the speaker identity information in the estimated content embedding extracted from speech. Under the guidance of the expected content embedding and the adversarial training, the content encoder is trained to extract speaker-independent content embedding from speech. Experiments on AIShell-3 dataset show that the proposed model outperforms AutoVC in terms of naturalness and similarity of converted speech.