论文标题
在前端增益不变建模,用于唤醒单词
On Front-end Gain Invariant Modeling for Wake Word Spotting
论文作者
论文摘要
由于声条件的复杂性和变化以及信号传输的环境干扰,唤醒单词(WW)的斑点在远场上具有挑战性。一套精心设计和优化的音频前端(AFE)算法有助于缓解这些挑战,并为WW Spotter等下游模块提供更好的质量音频信号。由于WW模型经过AFE处理的音频数据训练,因此其性能对AFE变化(例如增益变化)敏感。此外,部署到新设备时,无法保证WW性能,因为AFE是WW模型未知的。为了解决这些问题,我们提出了一种新颖的方法,使用一种名为$δ$ lfbe的新功能将AFE增益与WW模型的变化相结合。我们修改了神经网络体系结构以适应三角洲计算,而特征提取模块不变。我们使用从实际家庭设置收集的数据评估了我们的WW模型,并显示具有$δ$ lfbe的模型对于AFE增益的变化是可靠的。具体来说,当AFE增益最高可达$ \ pm $ 12dB时,基线CNN模型的误报率损失了相对19.0%,即虚假拒绝率为34.3%,而$δ$ lfbe的模型没有表现出绩效损失。
Wake word (WW) spotting is challenging in far-field due to the complexities and variations in acoustic conditions and the environmental interference in signal transmission. A suite of carefully designed and optimized audio front-end (AFE) algorithms help mitigate these challenges and provide better quality audio signals to the downstream modules such as WW spotter. Since the WW model is trained with the AFE-processed audio data, its performance is sensitive to AFE variations, such as gain changes. In addition, when deploying to new devices, the WW performance is not guaranteed because the AFE is unknown to the WW model. To address these issues, we propose a novel approach to use a new feature called $Δ$LFBE to decouple the AFE gain variations from the WW model. We modified the neural network architectures to accommodate the delta computation, with the feature extraction module unchanged. We evaluate our WW models using data collected from real household settings and showed the models with the $Δ$LFBE is robust to AFE gain changes. Specifically, when AFE gain changes up to $\pm$12dB, the baseline CNN model lost up to relative 19.0% in false alarm rate or 34.3% in false reject rate, while the model with $Δ$LFBE demonstrates no performance loss.