有条件的变异自动编码器，以改善多形音乐的神经音频综合

论文标题

有条件的变异自动编码器，以改善多形音乐的神经音频综合

Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound

论文作者

Lee, Seokjin, Kim, Minhan, Shin, Seunghyeon, Lee, Daeho, Jang, Inseon, Lim, Wootaek

论文摘要

最近有明显改善了音频合成的深层生成模型。但是，对原始波形进行建模的任务仍然是一个困难的问题，尤其是对于音频波形和音乐信号。最近，为高质量的音频波形合成而开发了实时音频变量自动编码器（RAVE）方法。狂欢方法基于变异自动编码器，并采用了两阶段的训练策略。不幸的是，Rave模型受到重现宽式复音音乐声音的限制。因此，为了提高重建性能，我们将音调激活数据作为辅助信息对RAVE模型进行。为了处理辅助信息，我们提出了一个具有条件变分自动编码器结构和额外完全连接的层的增强的狂欢模型。为了评估所提出的结构，我们根据具有隐藏参考的多个刺激测试进行了听力实验，并与大师一起进行了锚定（Mushra）。获得的结果表明，所提出的模型比常规狂欢模型表现出更为重要的性能和稳定性。

Deep generative models for audio synthesis have recently been significantly improved. However, the task of modeling raw-waveforms remains a difficult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in reproducing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch activation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题