序列的MEL-SPECTROGRAM扩展以序列语音转换

论文标题

序列的MEL-SPECTROGRAM扩展以序列语音转换

Mel-spectrogram augmentation for sequence to sequence voice conversion

论文作者

Hwang, Yeongtae, Cho, Hyemin, Yang, Hongsun, Won, Dong-Ok, Oh, Insoo, Lee, Seong-Whan

论文摘要

为了训练序列到序列的语音转换模型，我们需要处理有关由相同话语组成的语音对数量不足的问题。这项研究通过实验研究了MEL-SPECTROGRAM扩展对从头开始训练序列到序列语音转换（VC）模型的影响。对于MEL-Spectrogragron的增强，我们采用了规格中提出的政策。此外，我们提出了更多数据变化的新策略（即频率翘曲，响度和时间长度控制）。此外，为了在不训练VC模型的情况下找到适当的增强策略的超参数，我们提出了降低实验成本的新指标，即降低实验成本的新指标，即每个恶化比率的变形。我们根据各种尺寸的训练集和增强策略比较了这些MEL-SPECTROGRAGIN图扩大方法的效果。在实验结果中，基于时轴翘曲的策略（即时间长度控制和时间扭曲）表现出比其他策略更好的性能。这些结果表明，使用MEL光谱图的使用对训练VC模型更有益。

For training the sequence-to-sequence voice conversion model, we need to handle an issue of insufficient data about the number of speech pairs which consist of the same utterance. This study experimentally investigated the effects of Mel-spectrogram augmentation on training the sequence-to-sequence voice conversion (VC) model from scratch. For Mel-spectrogram augmentation, we adopted the policies proposed in SpecAugment. In addition, we proposed new policies (i.e., frequency warping, loudness and time length control) for more data variations. Moreover, to find the appropriate hyperparameters of augmentation policies without training the VC model, we proposed hyperparameter search strategy and the new metric for reducing experimental cost, namely deformation per deteriorating ratio. We compared the effect of these Mel-spectrogram augmentation methods based on various sizes of training set and augmentation policies. In the experimental results, the time axis warping based policies (i.e., time length control and time warping.) showed better performance than other policies. These results indicate that the use of the Mel-spectrogram augmentation is more beneficial for training the VC model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题