论文标题
文本到语音系统中的建筑合成扬声器概况
Building Synthetic Speaker Profiles in Text-to-Speech Systems
论文作者
论文摘要
多演讲者TTS系统中说话者配置文件的多样性是其性能的关键方面,因为它衡量了多少不同的说话者概况TTS系统可能会合成。但是,在构建多演讲者TTS系统时,通常会忽略这个重要方面,并且没有既定的框架来评估这种多样性。背后的原因是,大多数多演讲者TTS系统仅限于产生具有与培训数据相同的扬声器配置文件的语音信号。他们经常使用离散的扬声器嵌入向量,这些向量与单个扬声器具有一对一的信件。这种对应关系限制了TTS系统,并阻碍了它们在训练过程中未出现的看不见的说话者概况的能力。在本文中,我们旨在构建具有更多扬声器配置文件的多扬声器TTS系统,并可以生成与培训数据不同的新合成扬声器配置文件。为此,我们建议使用具有三重损失和特定的洗牌机制的生成模型。在我们的实验中,已通过合成语音信号的独特性和清晰度来证明了该方法的有效性和优势。
The diversity of speaker profiles in multi-speaker TTS systems is a crucial aspect of its performance, as it measures how many different speaker profiles TTS systems could possibly synthesize. However, this important aspect is often overlooked when building multi-speaker TTS systems and there is no established framework to evaluate this diversity. The reason behind is that most multi-speaker TTS systems are limited to generate speech signals with the same speaker profiles as its training data. They often use discrete speaker embedding vectors which have a one-to-one correspondence with individual speakers. This correspondence limits TTS systems and hinders their capability of generating unseen speaker profiles that did not appear during training. In this paper, we aim to build multi-speaker TTS systems that have a greater variety of speaker profiles and can generate new synthetic speaker profiles that are different from training data. To this end, we propose to use generative models with a triplet loss and a specific shuffle mechanism. In our experiments, the effectiveness and advantages of the proposed method have been demonstrated in terms of both the distinctiveness and intelligibility of synthesized speech signals.