通过基于DNN的绩效指标预测语音清晰度

论文标题

通过基于DNN的绩效指标预测语音清晰度

Prediction of speech intelligibility with DNN-based performance measures

论文作者

Martinez, Angel Mario Castro, Spille, Constantin, Roßbach, Jana, Kollmeier, Birger, Meyer, Bernd T.

论文摘要

本文提出了一个基于自动语音识别（ASR）的语音清晰度模型，结合了深神经网络（DNN）的音素概率，并估算了这些概率估算单词错误率的性能度量。该模型不需要干净的语音参考或测试期间的单词标签，因为ASR解码步骤发现了最可能给定音素后验概率的单词序列。通过八个正常听众的预测和观察到的语音接收阈值之间的根平方误差来评估该模型。识别任务包括从德国矩阵句子测试中识别嘈杂的单词。将语音材料与八个噪声遮罩器混合在一起，涵盖了不同的调制类型，从语音形状的固定噪声到单键式掩膜器。将预测性能与使用单词标签的五个已建立模型和ASR模型进行了比较。测试了两种功能和网络的组合。两者都包括在特征级别（幅度调制过滤库和馈电网络）或由体系结构捕获的时间信息（MEL-SPECTROGRAM和TIME-DELAY DEEP DEEP神经网络TDNN）。 TDNN模型与DNN相当，同时将参数数减少了37倍。这种优化允许在专用助听器硬件上并行流，因为可以在每个帧的10ms内计算前进通道。所提出的模型的性能几乎和基于标签的模型一样，并且比基线模型产生更准确的预测。

This paper presents a speech intelligibility model based on automatic speech recognition (ASR), combining phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities. This model does not require the clean speech reference nor the word labels during testing as the ASR decoding step, which finds the most likely sequence of words given phoneme posterior probabilities, is omitted. The model is evaluated via the root-mean-squared error between the predicted and observed speech reception thresholds from eight normal-hearing listeners. The recognition task consists of identifying noisy words from a German matrix sentence test. The speech material was mixed with eight noise maskers covering different modulation types, from speech-shaped stationary noise to a single-talker masker. The prediction performance is compared to five established models and an ASR-model using word labels. Two combinations of features and networks were tested. Both include temporal information either at the feature level (amplitude modulation filterbanks and a feed-forward network) or captured by the architecture (mel-spectrograms and a time-delay deep neural network, TDNN). The TDNN model is on par with the DNN while reducing the number of parameters by a factor of 37; this optimization allows parallel streams on dedicated hearing aid hardware as a forward-pass can be computed within the 10ms of each frame. The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题