使用特征插值可控制的HIFI-GAN语言控制

论文标题

使用特征插值可控制的HIFI-GAN语言控制

Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

论文作者

Xin, Detai, Takamichi, Shinnosuke, Okamoto, Takuma, Kawai, Hisashi, Saruwatari, Hiroshi

论文摘要

本文介绍了可控制的Hifi-Gan神经声码器。原始的Hifi-Gan是一种高保真，计算上的高效和小脚印神经声码器。我们试图将口语控制功能纳入HIFI-GAN，以改善合成语音的可访问性。所提出的方法将可区分的插值层插入HIFI-GAN结构中。在提出的方法中实现了信号重新采样方法和图像缩放方法，以扭曲MEL光谱图或神经声码器的隐藏特征。我们还设计和开放源代码，其中包含三种口语率来评估拟议的口语控制方法。综合客观和主观评估的实验结果表明，1）所提出的方法在语音自然中优于基线时间尺度的修改算法，2）通过图像缩放来扭曲MEL光谱图在所有提出的方法中获得的最佳性能，而3）建议的说话速率控制方法可以将其纳入无效的计算能力中。

This paper presents a speaking-rate-controllable HiFi-GAN neural vocoder. Original HiFi-GAN is a high-fidelity, computationally efficient, and tiny-footprint neural vocoder. We attempt to incorporate a speaking rate control function into HiFi-GAN for improving the accessibility of synthetic speech. The proposed method inserts a differentiable interpolation layer into the HiFi-GAN architecture. A signal resampling method and an image scaling method are implemented in the proposed method to warp the mel-spectrograms or hidden features of the neural vocoder. We also design and open-source a Japanese speech corpus containing three kinds of speaking rates to evaluate the proposed speaking rate control method. Experimental results of comprehensive objective and subjective evaluations demonstrate that 1) the proposed method outperforms a baseline time-scale modification algorithm in speech naturalness, 2) warping mel-spectrograms by image scaling obtained the best performance among all proposed methods, and 3) the proposed speaking rate control method can be incorporated into HiFi-GAN without losing computational efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题