在半监督的VQ-VAE范式中学习解开的电话和扬声器表示

论文标题

在半监督的VQ-VAE范式中学习解开的电话和扬声器表示

Learning Disentangled Phone and Speaker Representations in a Semi-Supervised VQ-VAE Paradigm

论文作者

Williams, Jennifer, Zhao, Yi, Cooper, Erica, Yamagishi, Junichi

论文摘要

我们通过向VQ-VAE架构引入语音综合的新组件，提出了一种新的方法来解开扬声器语音和电话内容。原始的VQ-VAE并不能很好地概括到看不见的扬声器或内容。为了减轻这个问题，我们合并了一个扬声器编码器和扬声器VQ代码簿，该编码器与现有的子手机代码书完全分开了全球说话者的特征。我们还比较了两种培训方法：与全球条件进行自我监督，并用扬声器标签进行半监督。添加扬声器VQ组件可改善语音合成质量的客观度量（估计的MOS，说话者的相似性，基于ASR的可理解性），并提供有意义的学说。我们的扬声器VQ代码书索引可以在简单的扬声器诊断任务中使用，并且比X-Vector基线表现更好。此外，可以从我们的半监督VQ-VAE中的子手机VQ代码书索引中识别手机，而不是全球条件的自我监督。

We present a new approach to disentangle speaker voice and phone content by introducing new components to the VQ-VAE architecture for speech synthesis. The original VQ-VAE does not generalize well to unseen speakers or content. To alleviate this problem, we have incorporated a speaker encoder and speaker VQ codebook that learns global speaker characteristics entirely separate from the existing sub-phone codebooks. We also compare two training methods: self-supervised with global conditions and semi-supervised with speaker labels. Adding a speaker VQ component improves objective measures of speech synthesis quality (estimated MOS, speaker similarity, ASR-based intelligibility) and provides learned representations that are meaningful. Our speaker VQ codebook indices can be used in a simple speaker diarization task and perform slightly better than an x-vector baseline. Additionally, phones can be recognized from sub-phone VQ codebook indices in our semi-supervised VQ-VAE better than self-supervised with global conditions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题