在非平稳声学环境中基于统计和神经网络的语音活动检测

论文标题

在非平稳声学环境中基于统计和神经网络的语音活动检测

Statistical and Neural Network Based Speech Activity Detection in Non-Stationary Acoustic Environments

论文作者

Heitkaemper, Jens, Schmalenstroeer, Joerg, Haeb-Umbach, Reinhold

论文摘要

语音活动检测（SAD）通常基于以下事实：噪声比语音更“静止”，在非平稳环境中尤其具有挑战性，因为声学场景的时间差异使得很难将语音与噪声区分开来。我们提出了两种SAD方法，其中一种是基于统计信号处理，而另一个则利用神经网络。前者使用复杂的信号处理以跟踪噪声和语音能量，并旨在支持资源有效，无监督的信号处理方法的案例。后者引入了一个反复的网络层，该网络层在输入语音的短段上运行，以在存在非平稳噪声的情况下进行时间平滑。该系统经过无所畏惧的步骤挑战进行了测试，该挑战由阿波罗11号太空任务的传输数据组成。统计SAD与早期提出的基于神经网络的SAD达到可比的检测性能，而基于神经网络的方法在2020年2020年无畏步骤挑战的评估集上，决策成本函数为1.07％，这为新的技术带来了新的状态。

Speech activity detection (SAD), which often rests on the fact that the noise is "more" stationary than speech, is particularly challenging in non-stationary environments, because the time variance of the acoustic scene makes it difficult to discriminate speech from noise. We propose two approaches to SAD, where one is based on statistical signal processing, while the other utilizes neural networks. The former employes sophisticated signal processing to track the noise and speech energies and is meant to support the case for a resource efficient, unsupervised signal processing approach. The latter introduces a recurrent network layer that operates on short segments of the input speech to do temporal smoothing in the presence of non-stationary noise. The systems are tested on the Fearless Steps challenge, which consists of the transmission data from the Apollo-11 space mission. The statistical SAD achieves comparable detection performance to earlier proposed neural network based SADs, while the neural network based approach leads to a decision cost function of 1.07% on the evaluation set of the 2020 Fearless Steps Challenge, which sets a new state of the art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题