论文标题
多视图频率LSTM:自动语音识别的有效前端
Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech Recognition
论文作者
论文摘要
实时语音识别系统中的声学模型通常堆叠多个单向LSTM层,以随着时间的推移处理声学框架。通过将一堆频率-LSTM(FLSTM)层的堆叠到TIME LSTM的情况下,已经报道了Vanilla LSTM体系结构的性能改进。这些FLSTM层可以通过对声学输入信号中的时频相关性进行建模,从而了解LSTM层的更强大的输入功能。然而,基于FLSTM的架构的缺点是它们以预定义的,调谐的,窗口大小和步幅为单位,在本文中被称为“视图”。我们通过将具有不同视图的多个FLSTM堆栈的输出组合到维度降低特征表示形式中来提出简单有效的修改。与具有单个视图的FLSTM模型相比,提出的多视图FLSTM体系结构允许建模更广泛的时频相关性。当对50k小时的英语远场语音数据进行CTC损失,然后进行SMBR序列训练进行培训时,我们表明,多视图FLSTM声学模型可在优化的单个FLSTM模型上为不同的扬声器和声学环境方案提供相对的单词错误率(WER)改善3-7%,同时又保留了类似的计算足迹。
Acoustic models in real-time speech recognition systems typically stack multiple unidirectional LSTM layers to process the acoustic frames over time. Performance improvements over vanilla LSTM architectures have been reported by prepending a stack of frequency-LSTM (FLSTM) layers to the time LSTM. These FLSTM layers can learn a more robust input feature to the time LSTM layers by modeling time-frequency correlations in the acoustic input signals. A drawback of FLSTM based architectures however is that they operate at a predefined, and tuned, window size and stride, referred to as 'view' in this paper. We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views, into a dimensionality reduced feature representation. The proposed multi-view FLSTM architecture allows to model a wider range of time-frequency correlations compared to an FLSTM model with single view. When trained on 50K hours of English far-field speech data with CTC loss followed by sMBR sequence training, we show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7% for different speaker and acoustic environment scenarios over an optimized single FLSTM model, while retaining a similar computational footprint.