论文标题
实时神经语音编码的解开功能学习
Disentangled Feature Learning for Real-Time Neural Speech Coding
论文作者
论文摘要
最近,端到端的神经音频/语音编码表明,其巨大的潜力超过了基于信号分析的传统音频编解码器。这主要是通过遵循VQ-VAE范式来实现的,在该范式中学习了盲型,矢量定量和编码。在本文中,我们提议为实时神经语音编码学习解开特征而不是盲目的端到端学习。具体而言,更像全球般的说话者身份和本地内容特征,以分离为代表语音。这种紧凑的功能分解不仅通过在不同功能之间利用位分配来实现更好的编码效率,而且还提供了在嵌入空间中进行音频编辑的灵活性,例如在实时通信中语音转换。主观和客观的结果都表明了其编码效率,我们发现学习的分离特征在任何一系列语音转换上都表现出可比的性能,并具有现代的自我监督语音表示模型,参数少得多,延迟较低,显示了我们的神经编码框架的潜力。
Recently end-to-end neural audio/speech coding has shown its great potential to outperform traditional signal analysis based audio codecs. This is mostly achieved by following the VQ-VAE paradigm where blind features are learned, vector-quantized and coded. In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding. Specifically, more global-like speaker identity and local content features are learned with disentanglement to represent speech. Such a compact feature decomposition not only achieves better coding efficiency by exploiting bit allocation among different features but also provides the flexibility to do audio editing in embedding space, such as voice conversion in real-time communications. Both subjective and objective results demonstrate its coding efficiency and we find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models with far less parameters and low latency, showing the potential of our neural coding framework.