使用标准化流量的无文本非平行多对语音转换

论文标题

使用标准化流量的无文本非平行多对语音转换

Text-free non-parallel many-to-many voice conversion using normalising flows

论文作者

Merritt, Thomas, Ezzerg, Abdelhamid, Biliński, Piotr, Proszewska, Magdalena, Pokora, Kamil, Barra-Chicote, Roberto, Korzekwa, Daniel

论文摘要

通常使用源语音的有损表示实现非并行语音转换（VC）。但是，确保仅删除说话者身份信息，而保留源语音的所有其他信息是一个巨大的挑战。在推理时间我们不知道要读取的文本，即免费的VC的情况下，这尤其具有挑战性。为了减轻这种情况，我们研究了信息保护风险投资方法。归一化流已经引起了文本到语音合成的关注，但是对VC的探索还不足。流利用可逆函数来学习数据的可能性，从而提供了语音的无损编码。我们在文本条件和无文本的场景中研究VC的归一流流量。此外，对于不含文本的VC，我们比较了预培训和共同学习的先验。基于流量的VC评估表明，无文本和文本条件的VC之间没有降解，从而对最先进的进行了改进。同样，发现先验的联合培训会对无文本VC质量产生负面影响。

Non-parallel voice conversion (VC) is typically achieved using lossy representations of the source speech. However, ensuring only speaker identity information is dropped whilst all other information from the source speech is retained is a large challenge. This is particularly challenging in the scenario where at inference-time we have no knowledge of the text being read, i.e., text-free VC. To mitigate this, we investigate information-preserving VC approaches. Normalising flows have gained attention for text-to-speech synthesis, however have been under-explored for VC. Flows utilize invertible functions to learn the likelihood of the data, thus provide a lossless encoding of speech. We investigate normalising flows for VC in both text-conditioned and text-free scenarios. Furthermore, for text-free VC we compare pre-trained and jointly-learnt priors. Flow-based VC evaluations show no degradation between text-free and text-conditioned VC, resulting in improvements over the state-of-the-art. Also, joint-training of the prior is found to negatively impact text-free VC quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题