论文标题

使用自我监督的离散语音表示,任何一对一的顺序转换

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

论文作者

Huang, Wen-Chin, Wu, Yi-Chiao, Hayashi, Tomoki, Toda, Tomoki

论文摘要

我们提出了一种新颖的方法,以序列到序列(SEQ2SEQ)框架以任何对一(A2O)语音转换(VC)。 A2O VC旨在将任何扬声器(包括在培训期间看不见的说话者)转换为固定的目标扬声器。我们利用VQ-WAV2VEC(VQW2V),这是一种离散的自我监管的语音表示,它是从大量未标记的数据中学到的,假定是说话者独立的,并且与潜在的语言内容很好。鉴于目标扬声器的培训数据集,我们提取VQW2V和声学特征,以估算从前者到后者的SEQ2SEQ映射函数。借助预处理方法和新设计的后处理技术,我们的模型可以推广到仅5分钟的数据,即使表现优于经过并行数据训练的相同模型。

We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well corresponds to underlying linguistic contents. Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq mapping function from the former to the latter. With the help of a pretraining method and a newly designed postprocessing technique, our model can be generalized to only 5 min of data, even outperforming the same model trained with parallel data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源