可扩展神经语音编码的跨尺度矢量量化

论文标题

可扩展神经语音编码的跨尺度矢量量化

Cross-Scale Vector Quantization for Scalable Neural Speech Coding

论文作者

Jiang, Xue, Peng, Xiulian, Xue, Huaying, Zhang, Yuan, Lu, Yan

论文摘要

比特率可伸缩性是实时通信中音频编码的理想功能。现有的神经音频编解码器通常在训练过程中强制执行特定的比特率，因此需要为每个目标比特率对不同的模型进行培训，这增加了发件人的内存足迹，并且通常需要进行转码来支持多个接收器。在本文中，我们引入了跨尺度可扩展矢量量化方案（CSVQ），其中多尺度特征通过逐步特征融合和改进逐渐编码。这样，如果仅接收到一部分bitstream，则将重建粗级信号，并随着更多位可用而逐渐提高质量。提出的CSVQ方案可以灵活地应用于具有镜像自动编码器结构的任何神经音频编码网络，以实现比特率的可伸缩性。主观结果表明，所提出的方案的表现优于经典残差VQ（RVQ）。此外，拟议的3 kbps的CSVQ以9 kbps的价格优于3kbps的lyra，它可以随着比特率的增加提供优雅的质量提升。

Bitrate scalability is a desirable feature for audio coding in real-time communications. Existing neural audio codecs usually enforce a specific bitrate during training, so different models need to be trained for each target bitrate, which increases the memory footprint at the sender and the receiver side and transcoding is often needed to support multiple receivers. In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ), in which multi-scale features are encoded progressively with stepwise feature fusion and refinement. In this way, a coarse-level signal is reconstructed if only a portion of the bitstream is received, and progressively improves the quality as more bits are available. The proposed CSVQ scheme can be flexibly applied to any neural audio coding network with a mirrored auto-encoder structure to achieve bitrate scalability. Subjective results show that the proposed scheme outperforms the classical residual VQ (RVQ) with scalability. Moreover, the proposed CSVQ at 3 kbps outperforms Opus at 9 kbps and Lyra at 3kbps and it could provide a graceful quality boost with bitrate increase.

下载PDF全文

下载文献需遵守相关版权规定

论文标题