使用自我监督的预训练模型和矢量量化的语音增强

论文标题

使用自我监督的预训练模型和矢量量化的语音增强

Speech Enhancement Using Self-Supervised Pre-Trained Model and Vector Quantization

论文作者

Zhao, Xiao-Ying, Zhu, Qiu-Shi, Zhang, Jie

论文摘要

随着深度学习的发展，基于神经网络的语音增强（SE）模型表现出了出色的性能。同时，结果表明，可以将自我监管的预训练模型的开发应用于各种下游任务。在本文中，我们将考虑将预训练模型应用于实时SE问题。具体而言，使用自我监督预处理的WAVLM模型初始化了Demucs模型的编码器和瓶颈层，编码器中的卷积由因果卷积代替，并且瓶颈层中的变压器编码器基于因果注意力掩膜。另外，由于离散的嘈杂的语音表示形式对降解更有益，因此我们利用量化模块从瓶颈层离散表示表示，然后将其送入解码器中以重建干净的语音波形。 Valentini数据集和内部数据集的实验结果表明，基于预先训练的模型的初始化可以改善SE性能，而离散操作可以在某种程度上抑制表示噪声组件，从而可以进一步改善性能。

With the development of deep learning, neural network-based speech enhancement (SE) models have shown excellent performance. Meanwhile, it was shown that the development of self-supervised pre-trained models can be applied to various downstream tasks. In this paper, we will consider the application of the pre-trained model to the real-time SE problem. Specifically, the encoder and bottleneck layer of the DEMUCS model are initialized using the self-supervised pretrained WavLM model, the convolution in the encoder is replaced by causal convolution, and the transformer encoder in the bottleneck layer is based on causal attention mask. In addition, as discretizing the noisy speech representations is more beneficial for denoising, we utilize a quantization module to discretize the representation output from the bottleneck layer, which is then fed into the decoder to reconstruct the clean speech waveform. Experimental results on the Valentini dataset and an internal dataset show that the pre-trained model based initialization can improve the SE performance and the discretization operation suppresses the noise component in the representations to some extent, which can further improve the performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题