用听觉模型进行言语denoing

论文标题

用听觉模型进行言语denoing

Speech Denoising with Auditory Models

论文作者

Saddler, Mark R., Francl, Andrew, Feather, Jenelle, Qian, Kaizhi, Zhang, Yang, McDermott, Josh H.

论文摘要

当代语音增强主要依赖于经过训练以重建干净语音波形的音频变换。高性能神经网络声音识别系统的发展增加了将深层特征表示作为“感知”损失来训练Denoising Systems的可能性。我们通过首先训练深层神经网络来探索他们的实用性，以对音频的口语或环境声音进行分类。然后，我们训练了音频变换，以将嘈杂的语音映射到音频波形，从而最大程度地减少了输出音频和相应的干净音频之间深度特征表示的差异。所得转换的噪声要比训练以重建清洁波形的基线方法要好得多，并且使用深度特征损失优于先前的方法。但是，仅通过使用从过滤器库输入到深网的损失而获得类似的好处。结果表明，深度功能可以指导语音增强，但表明它们尚未超过不涉及学习功能的简单替代方案。

Contemporary speech enhancement predominantly relies on audio transforms that are trained to reconstruct a clean speech waveform. The development of high-performing neural network sound recognition systems has raised the possibility of using deep feature representations as 'perceptual' losses with which to train denoising systems. We explored their utility by first training deep neural networks to classify either spoken words or environmental sounds from audio. We then trained an audio transform to map noisy speech to an audio waveform that minimized the difference in the deep feature representations between the output audio and the corresponding clean audio. The resulting transforms removed noise substantially better than baseline methods trained to reconstruct clean waveforms, and also outperformed previous methods using deep feature losses. However, a similar benefit was obtained simply by using losses derived from the filter bank inputs to the deep networks. The results show that deep features can guide speech enhancement, but suggest that they do not yet outperform simple alternatives that do not involve learned features.

下载PDF全文

下载文献需遵守相关版权规定

论文标题