论文标题

顺序与序列语音综合中的韵律突出和边界

Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

论文作者

Suni, Antti, Kakouros, Sofoklis, Vainio, Martti, Šimko, Juraj

论文摘要

深度学习方法的最新进展将合成语音质量提升到人类水平,并且该领域现在正朝着解决合成语音的韵律变化方面的发展。尽管在这项工作中取得了成功,但最先进的系统却没有忠实地重现当地的韵律事件,这些事件引起了,例如,词层次强调强调和小规模的结构。这种类型的韵律变化通常反映出长距离语义关系,这些语义关系对于以单个句子作为其合成域而无法访问。可能的解决方案之一可能是通过明确的韵律标签来调节综合语音,并可能使用文本的较长部分生成。在这项工作中,我们评估使用这样的韵律标签增强文本输入是否可以捕获单词级别的突出性和短语边界强度可以导致更准确的句子韵律实现。我们使用基于自动小波的技术从语音材料中提取此类标签,并将其用作类似Tacotron的综合系统的输入以及文本信息。对合成语音的客观评估的结果表明,与最先进的实现相比,使用韵律标签在F0和能量轮廓方面显着提高了输出。

Recent advances in deep learning methods have elevated synthetic speech quality to human level, and the field is now moving towards addressing prosodic variation in synthetic speech.Despite successes in this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis and phrasal structure. This type of prosodic variation often reflects long-distance semantic relationships that are not accessible for end-to-end systems with a single sentence as their synthesis domain. One of the possible solutions might be conditioning the synthesized speech by explicit prosodic labels, potentially generated using longer portions of text. In this work we evaluate whether augmenting the textual input with such prosodic labels capturing word-level prominence and phrasal boundary strength can result in more accurate realization of sentence prosody. We use an automatic wavelet-based technique to extract such labels from speech material, and use them as an input to a tacotron-like synthesis system alongside textual information. The results of objective evaluation of synthesized speech show that using the prosodic labels significantly improves the output in terms of faithfulness of f0 and energy contours, in comparison with state-of-the-art implementations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源