论文标题
使用Wavenet自动编码器进行语音转换的无监督声学单元表示学习
Unsupervised Acoustic Unit Representation Learning for Voice Conversion using WaveNet Auto-encoders
论文作者
论文摘要
近年来,无监督的言语的表示言语的学习一直引起人们的兴趣,例如,这是显而易见的,这是在广泛的挑战的广泛利益中。这项工作提出了一种基于Wavenet自动编码器的学习框架级别表示的新方法。在2019年的Zerospeech挑战中特别感兴趣的是具有离散潜在变量的模型,例如量化量化的变异自动编码器(VQVAE)。但是,这些模型产生的语音质量相对较差。在这项工作中,我们旨在通过两种方法来解决这一点:第一个aveenet用作解码器,并直接从潜在表示中生成波形数据;其次,通过两种替代性分离学习方法,即标准化和切片矢量量化,潜在表示的低复杂性得到改善。该方法是在最近的Zerospeech Challenge 2020的背景下开发和测试的。提交给挑战的系统输出获得了自然性的最高位置(平均意见分数4.06),清晰度的最高位置(字符错误率为0.15)和表示表示质量的第三位置(ABX测试分数12.5)。本文中的这些和进一步的分析表明,转换后的语音和声学单位表示的质量可以很好地平衡。
Unsupervised representation learning of speech has been of keen interest in recent years, which is for example evident in the wide interest of the ZeroSpeech challenges. This work presents a new method for learning frame level representations based on WaveNet auto-encoders. Of particular interest in the ZeroSpeech Challenge 2019 were models with discrete latent variable such as the Vector Quantized Variational Auto-Encoder (VQVAE). However these models generate speech with relatively poor quality. In this work we aim to address this with two approaches: first WaveNet is used as the decoder and to generate waveform data directly from the latent representation; second, the low complexity of latent representations is improved with two alternative disentanglement learning methods, namely instance normalization and sliced vector quantization. The method was developed and tested in the context of the recent ZeroSpeech challenge 2020. The system output submitted to the challenge obtained the top position for naturalness (Mean Opinion Score 4.06), top position for intelligibility (Character Error Rate 0.15), and third position for the quality of the representation (ABX test score 12.5). These and further analysis in this paper illustrates that quality of the converted speech and the acoustic units representation can be well balanced.