通过语言语音正则化和伪填充的插入，提高自发语音综合的鲁棒性

论文标题

通过语言语音正则化和伪填充的插入，提高自发语音综合的鲁棒性

Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion

论文作者

Matsunaga, Yuta, Saeki, Takaaki, Takamichi, Shinnosuke, Saruwatari, Hiroshi

论文摘要

我们提出了一种具有语言语音正则化的培训方法，该方法通过填充停顿（FP）插入来提高自发语音合成方法的鲁棒性。自发的言语综合旨在用像FPS这样的人类般的疏远产生言语。由于用丰富的FP词汇对自发语音的复杂数据分布进行建模是具有挑战性的，因此FP插入的合成语音的质量通常受到限制。为了解决这个问题，我们提出了一种合成自发语音的方法，可以改善对不同FP插入的鲁棒性。正则化用于稳定语言语音（即非FP）元素的合成。为了进一步提高对不同FP插入的鲁棒性，它利用使用FP单词预测模型和地面真相FPS采样的伪FPS。我们的实验表明，所提出的方法分别将合成语音的自然性提高了地面真相和预测FPS的自然性，分别提高了0.24和0.26。

We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech with human-like disfluencies, such as FPs. Because modeling the complex data distribution of spontaneous speech with a rich FP vocabulary is challenging, the quality of FP-inserted synthetic speech is often limited. To address this issue, we present a method for synthesizing spontaneous speech that improves robustness to diverse FP insertions. Regularization is used to stabilize the synthesis of the linguistic speech (i.e., non-FP) elements. To further improve robustness to diverse FP insertions, it utilizes pseudo-FPs sampled using an FP word prediction model as well as ground-truth FPs. Our experiments demonstrated that the proposed method improves the naturalness of synthetic speech with ground-truth and predicted FPs by 0.24 and 0.26, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题