简化基于变压器的端到端语音识别的自我注意力

论文标题

简化基于变压器的端到端语音识别的自我注意力

Simplified Self-Attention for Transformer-based End-to-End Speech Recognition

论文作者

Luo, Haoneng, Zhang, Shiliang, Lei, Ming, Xie, Lei

论文摘要

变压器模型已被引入端到端的语音识别，并在对长期依赖性建模方面的优越性上具有最先进的性能。但是，这种改进通常是通过使用非常大的神经网络获得的。变压器模型主要包括两个子模型 - 位置馈电层和自我注意事项（SAN）层。在本文中，为了降低模型的复杂性，同时保持良好的性能，我们提出了一个简化的自我注意力（SSAN）层，该层采用FSMN存储器块而不是投影层来形成基于变压器的端到端语音识别的查询和关键向量。我们在公共Aishell-1，内部1000小时和20,000小时的大规模普通话任务上评估了基于SSAN的SSA和常规SAN变压器。结果表明，我们提出的基于SSAN的变压器模型可以实现模型参数相对减少超过20％的相对减少，而Aishell-1任务的相对CER减少了6.7％。降低了20％的参数，我们的模型在20,000小时的大规模任务上没有表现出识别性能的丧失。

Transformer models have been introduced into end-to-end speech recognition with state-of-the-art performance on various tasks owing to their superiority in modeling long-term dependencies. However, such improvements are usually obtained through the use of very large neural networks. Transformer models mainly include two submodules - position-wise feedforward layers and self-attention (SAN) layers. In this paper, to reduce the model complexity while maintaining good performance, we propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors for transformer-based end-to-end speech recognition. We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks. Results show that our proposed SSAN-based transformer model can achieve over 20% relative reduction in model parameters and 6.7% relative CER reduction on the AISHELL-1 task. With impressively 20% parameter reduction, our model shows no loss of recognition performance on the 20,000-hour large-scale task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题