使用自我监督的离散语音表示，任何一对一的顺序转换

论文标题

使用自我监督的离散语音表示，任何一对一的顺序转换

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

论文作者

Huang, Wen-Chin, Wu, Yi-Chiao, Hayashi, Tomoki, Toda, Tomoki

论文摘要

我们提出了一种新颖的方法，以序列到序列（SEQ2SEQ）框架以任何对一（A2O）语音转换（VC）。 A2O VC旨在将任何扬声器（包括在培训期间看不见的说话者）转换为固定的目标扬声器。我们利用VQ-WAV2VEC（VQW2V），这是一种离散的自我监管的语音表示，它是从大量未标记的数据中学到的，假定是说话者独立的，并且与潜在的语言内容很好。鉴于目标扬声器的培训数据集，我们提取VQW2V和声学特征，以估算从前者到后者的SEQ2SEQ映射函数。借助预处理方法和新设计的后处理技术，我们的模型可以推广到仅5分钟的数据，即使表现优于经过并行数据训练的相同模型。

We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well corresponds to underlying linguistic contents. Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq mapping function from the former to the latter. With the help of a pretraining method and a newly designed postprocessing technique, our model can be generalized to only 5 min of data, even outperforming the same model trained with parallel data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题