使用基于最佳传输的高斯混合模型的最佳插入式插值生成中间扬声器的生成

论文标题

使用基于最佳传输的高斯混合模型的最佳插入式插值生成中间扬声器的生成

Mid-attribute speaker generation using optimal-transport-based interpolation of Gaussian mixture models

论文作者

Watanabe, Aya, Takamichi, Shinnosuke, Saito, Yuki, Xin, Detai, Saruwatari, Hiroshi

论文摘要

在本文中，我们提出了一种介入多个扬声器的属性并在``说话者生成''中多样化的语音特征的方法，这是一项新兴任务，旨在综合不存在的说话者自然听起来的声音。传统的基于Tacospawn的扬声器生成方法代表了用扬声器属性调节的高斯混合模型（GMM）嵌入扬声器的分布。尽管此方法可以从说话者属于gmms中对各种说话者进行采样，但尚不清楚博学的分布是否可以代表具有中间属性的说话者（即中间属性）。为此，我们提出了一种基于最佳传播的方法，该方法将学习的GMM插值以生成不存在的扬声器，并以中属性（例如性别中性）的声音产生不存在的扬声器。我们从经验上验证了我们的方法，并评估合成语音的自然性以及两个说话者属性的可控性：性别和语言流利性。评估结果表明，我们的方法可以通过连续的标量值来控制生成的扬声器的属性，而不会对语音自然性的统计学意义下降。

In this paper, we propose a method for intermediating multiple speakers' attributes and diversifying their voice characteristics in ``speaker generation,'' an emerging task that aims to synthesize a nonexistent speaker's naturally sounding voice. The conventional TacoSpawn-based speaker generation method represents the distributions of speaker embeddings by Gaussian mixture models (GMMs) conditioned with speaker attributes. Although this method enables the sampling of various speakers from the speaker-attribute-aware GMMs, it is not yet clear whether the learned distributions can represent speakers with an intermediate attribute (i.e., mid-attribute). To this end, we propose an optimal-transport-based method that interpolates the learned GMMs to generate nonexistent speakers with mid-attribute (e.g., gender-neutral) voices. We empirically validate our method and evaluate the naturalness of synthetic speech and the controllability of two speaker attributes: gender and language fluency. The evaluation results show that our method can control the generated speakers' attributes by a continuous scalar value without statistically significant degradation of speech naturalness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题