论文标题
通过组合无监督和监督的语音表示,无发音字典的多语言语音综合
Pronunciation Dictionary-Free Multilingual Speech Synthesis by Combining Unsupervised and Supervised Phonetic Representations
论文作者
论文摘要
本文提出了一种多语言语音综合方法,该方法结合了无监督的语音表示(UPR)和监督语音表示(SPR),以避免依赖目标语言的发音字典。在此方法中,采用了预验证的WAV2VEC 2.0模型来提取UPRS,并使用连接的时间分类(CTC)损失构建了独立于语言的自动语音识别(LI-ASR)模型,以从目标语言的音频数据中提取段级别的SPRS。然后,设计了一个声学模型,该模型首先从文本分别预测UPRS和SPR,然后结合预测的UPRS和SPRs以生成MEL光谱图。我们对六种语言实验的结果表明,所提出的方法的表现优于直接预测来自字符或音素序列的MEL-SEXPROGRAME的方法,以及仅使用UPRS或SPRS的融化模型。
This paper proposes a multilingual speech synthesis method which combines unsupervised phonetic representations (UPR) and supervised phonetic representations (SPR) to avoid reliance on the pronunciation dictionaries of target languages. In this method, a pretrained wav2vec 2.0 model is adopted to extract UPRs and a language-independent automatic speech recognition (LI-ASR) model is built with a connectionist temporal classification (CTC) loss to extract segment-level SPRs from the audio data of target languages. Then, an acoustic model is designed, which first predicts UPRs and SPRs from texts separately and then combines the predicted UPRs and SPRs to generate mel-spectrograms. The results of our experiments on six languages show that the proposed method outperformed the methods that directly predicted mel-spectrograms from character or phoneme sequences and the ablated models that utilized only UPRs or SPRs.