fbwave：在边缘流式传输语音的高效且可扩展的神经声码编码器

论文标题

fbwave：在边缘流式传输语音的高效且可扩展的神经声码编码器

FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge

论文作者

Wu, Bichen, He, Qing, Zhang, Peizhao, Koehler, Thilo, Keutzer, Kurt, Vajda, Peter

论文摘要

如今，越来越多的应用程序可以从基于边缘的文本到语音（TTS）中受益。但是，大多数现有的TTS型号在计算上太昂贵，并且不够灵活，无法以同样多样化的计算能力来部署在各种各样的边缘设备上。为了解决这个问题，我们提出了FBWave，这是一个有效且可扩展的神经声码编码家族，可以为不同的边缘设备实现最佳性能效率折衷。 FBWave是一种基于混合流动的生成模型，结合了自回归和非自动回归模型的优势。它产生高质量的音频，并在推理过程中支持流媒体，同时保持高度计算的效率。我们的实验表明，FBWave可以达到与Wavernn相似的音频质量，同时将MAC降低40倍。 FBWave的更有效的变体可以达到可接受的音频质量，最多可减少109倍。音频演示可在https://bichenwu09.github.io/vocoder_demos上找到。

Nowadays more and more applications can benefit from edge-based text-to-speech (TTS). However, most existing TTS models are too computationally expensive and are not flexible enough to be deployed on the diverse variety of edge devices with their equally diverse computational capacities. To address this, we propose FBWave, a family of efficient and scalable neural vocoders that can achieve optimal performance-efficiency trade-offs for different edge devices. FBWave is a hybrid flow-based generative model that combines the advantages of autoregressive and non-autoregressive models. It produces high quality audio and supports streaming during inference while remaining highly computationally efficient. Our experiments show that FBWave can achieve similar audio quality to WaveRNN while reducing MACs by 40x. More efficient variants of FBWave can achieve up to 109x fewer MACs while still delivering acceptable audio quality. Audio demos are available at https://bichenwu09.github.io/vocoder_demos.

下载PDF全文

下载文献需遵守相关版权规定

论文标题