通过参考编码器和端到端文本到语音改善重音转换

论文标题

通过参考编码器和端到端文本到语音改善重音转换

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

论文作者

Li, Wenjie, Tang, Benlai, Yin, Xiang, Zhao, Yushi, Li, Wei, Wang, Kang, Huang, Hao, Wang, Yuxuan, Ma, Zejun

论文摘要

口音转换（AC）将非本地人的口音转化为本地口音，同时保持说话者的语音音色。在本文中，我们提出了改善重音转换适用性和质量的方法。首先，我们假设在转换阶段没有参考语音可用，因此我们采用了端到端的文本到语音系统，该系统经过本地语音训练以生成本地参考语音。为了提高转换性语音的质量和口音，我们介绍了参考编码器，使我们能够利用多源信息。这是由从天然参考和语言信息中提取的声学特征的动机，这些信息与常规语音后验（PPGS）互补，因此可以将它们串联为仅基于PPG的基线系统的特征。此外，我们使用基于GMM的注意力而不是窗口的注意来优化模型体系结构，以提高合成性能。实验结果表明，何时应用了所提出的技术，集成系统显着提高了声学质量（30 $ \％$ \％$的平均意见分数相对增加）和本机口音（68 $ \％$相对偏好），同时保留了非本地扬声器的语音身份。

Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre. In this paper, we propose approaches to improving accent conversion applicability, as well as quality. First of all, we assume no reference speech is available at the conversion stage, and hence we employ an end-to-end text-to-speech system that is trained on native speech to generate native reference speech. To improve the quality and accent of the converted speech, we introduce reference encoders which make us capable of utilizing multi-source information. This is motivated by acoustic features extracted from native reference and linguistic information, which are complementary to conventional phonetic posteriorgrams (PPGs), so they can be concatenated as features to improve a baseline system based only on PPGs. Moreover, we optimize model architecture using GMM-based attention instead of windowed attention to elevate synthesized performance. Experimental results indicate when the proposed techniques are applied the integrated system significantly raises the scores of acoustic quality (30$\%$ relative increase in mean opinion score) and native accent (68$\%$ relative preference) while retaining the voice identity of the non-native speaker.

下载PDF全文

下载文献需遵守相关版权规定

论文标题