使用频率对齐网络的强大多通道语音识别

论文标题

使用频率对齐网络的强大多通道语音识别

Robust Multi-channel Speech Recognition using Frequency Aligned Network

论文作者

Park, Taejin, Kumatani, Kenichi, Wu, Minhua, Sundaram, Shiva

论文摘要

传统的言语增强技术（例如波束形成）已知道远场语音识别的好处。我们自己在频域多通道声学建模方面的工作通过在声学模型中共同训练空间滤波层，显示了进一步的改进。在本文中，我们进一步发展了这个想法，并使用频率对齐网络进行鲁棒的多通道自动语音识别（ASR）。与频域中的仿射层不同，所提出的频率对齐组件可防止一个频率箱影响其他频率箱。我们表明，这种修改不仅减少了模型中的参数数量，而且可以显着改善ASR性能。我们通过ASR实验对频率对齐网络的影响在现实世界中的远场数据上，用户在不受控制的声学环境中与ASR系统进行交互。我们表明，具有频率对齐网络的多通道声学模型显示单词错误率相对降低高达18％。

Conventional speech enhancement technique such as beamforming has known benefits for far-field speech recognition. Our own work in frequency-domain multi-channel acoustic modeling has shown additional improvements by training a spatial filtering layer jointly within an acoustic model. In this paper, we further develop this idea and use frequency aligned network for robust multi-channel automatic speech recognition (ASR). Unlike an affine layer in the frequency domain, the proposed frequency aligned component prevents one frequency bin influencing other frequency bins. We show that this modification not only reduces the number of parameters in the model but also significantly and improves the ASR performance. We investigate effects of frequency aligned network through ASR experiments on the real-world far-field data where users are interacting with an ASR system in uncontrolled acoustic environments. We show that our multi-channel acoustic model with a frequency aligned network shows up to 18% relative reduction in word error rate.

下载PDF全文

下载文献需遵守相关版权规定

论文标题