多视图频率LSTM：自动语音识别的有效前端

论文标题

多视图频率LSTM：自动语音识别的有效前端

Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech Recognition

论文作者

Van Segbroeck, Maarten, Mallidih, Harish, King, Brian, Chen, I-Fan, Chadha, Gurpreet, Maas, Roland

论文摘要

实时语音识别系统中的声学模型通常堆叠多个单向LSTM层，以随着时间的推移处理声学框架。通过将一堆频率-LSTM（FLSTM）层的堆叠到TIME LSTM的情况下，已经报道了Vanilla LSTM体系结构的性能改进。这些FLSTM层可以通过对声学输入信号中的时频相关性进行建模，从而了解LSTM层的更强大的输入功能。然而，基于FLSTM的架构的缺点是它们以预定义的，调谐的，窗口大小和步幅为单位，在本文中被称为“视图”。我们通过将具有不同视图的多个FLSTM堆栈的输出组合到维度降低特征表示形式中来提出简单有效的修改。与具有单个视图的FLSTM模型相比，提出的多视图FLSTM体系结构允许建模更广泛的时频相关性。当对50k小时的英语远场语音数据进行CTC损失，然后进行SMBR序列训练进行培训时，我们表明，多视图FLSTM声学模型可在优化的单个FLSTM模型上为不同的扬声器和声学环境方案提供相对的单词错误率（WER）改善3-7％，同时又保留了类似的计算足迹。

Acoustic models in real-time speech recognition systems typically stack multiple unidirectional LSTM layers to process the acoustic frames over time. Performance improvements over vanilla LSTM architectures have been reported by prepending a stack of frequency-LSTM (FLSTM) layers to the time LSTM. These FLSTM layers can learn a more robust input feature to the time LSTM layers by modeling time-frequency correlations in the acoustic input signals. A drawback of FLSTM based architectures however is that they operate at a predefined, and tuned, window size and stride, referred to as 'view' in this paper. We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views, into a dimensionality reduced feature representation. The proposed multi-view FLSTM architecture allows to model a wider range of time-frequency correlations compared to an FLSTM model with single view. When trained on 50K hours of English far-field speech data with CTC loss followed by sMBR sequence training, we show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7% for different speaker and acoustic environment scenarios over an optimized single FLSTM model, while retaining a similar computational footprint.

下载PDF全文

下载文献需遵守相关版权规定

论文标题