论文标题
通过Stylegan2和Mel-Spectrograms产生多样的声音爆发
Generating Diverse Vocal Bursts with StyleGAN2 and MEL-Spectrograms
论文作者
论文摘要
我们描述了ICML表达性发声竞争的生成性情感声乐爆发任务(EXVO生成)的方法。我们在音频样品的预处理版本中训练有条件的stylegan2架构。然后将模型生成的MEL光谱图倒回音频域。结果,我们生成的样品从竞争所提供的基线上从定性和定量的观点上对所有情绪的基线进行了显着改善。更确切地说,即使对于我们表现最差的情感(敬畏),我们也获得了1.76的时尚,而基线则为4.81(作为参考,敬畏的火车/验证集之间的淡出为0.776)。
We describe our approach for the generative emotional vocal burst task (ExVo Generate) of the ICML Expressive Vocalizations Competition. We train a conditional StyleGAN2 architecture on mel-spectrograms of preprocessed versions of the audio samples. The mel-spectrograms generated by the model are then inverted back to the audio domain. As a result, our generated samples substantially improve upon the baseline provided by the competition from a qualitative and quantitative perspective for all emotions. More precisely, even for our worst-performing emotion (awe), we obtain an FAD of 1.76 compared to the baseline of 4.81 (as a reference, the FAD between the train/validation sets for awe is 0.776).