用单词n-gram语言模型改善普通话端到端语音识别

论文标题

用单词n-gram语言模型改善普通话端到端语音识别

Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

论文作者

Tian, Jinchuan, Yu, Jianwei, Weng, Chao, Zou, Yuexian, Yu, Dong

论文摘要

尽管端到端（E2E）自动语音识别（ASR）的进展很快，但已表明将外部语言模型（LMS）纳入解码可以进一步提高E2E ASR系统的识别性能。为了与E2E ASR系统中采用的建模单元保持一致，子词级（例如，字符，BPE）LMS通常用于与当前的E2E ASR系统合作。但是，子词级LMS的使用将忽略单词级别的信息，这可能会限制E2E ASR中外部LMS的强度。尽管已经提出了几种方法将单词级的外部LMS纳入E2E ASR中，但这些方法主要是为具有清晰单词边界的语言设计的，例如英语，并且不能直接应用于像Mandarin这样的语言，其中每个字符序列可以具有多个相应的单词序列。为此，我们提出了一种新颖的解码算法，在即时构造单词级晶格以考虑每个部分假设的所有可能的单词序列。然后，通过将生成的晶格与外部单词n-gram lm相交来获得假设的LM评分。对基于注意的编码器（AED）和神经传感器（NT）框架进行了研究。实验表明，我们的方法始终优于子词级LM，包括N-gram LM和Neural Network LM。我们在Aishell-1（CER 4.18％）和Aishell-2（CER 5.06％）数据集上实现最新结果，并在21k小时的普通话数据集中相对减少14.8％。

Despite the rapid progress of end-to-end (E2E) automatic speech recognition (ASR), it has been shown that incorporating external language models (LMs) into the decoding can further improve the recognition performance of E2E ASR systems. To align with the modeling units adopted in E2E ASR systems, subword-level (e.g., characters, BPE) LMs are usually used to cooperate with current E2E ASR systems. However, the use of subword-level LMs will ignore the word-level information, which may limit the strength of the external LMs in E2E ASR. Although several methods have been proposed to incorporate word-level external LMs in E2E ASR, these methods are mainly designed for languages with clear word boundaries such as English and cannot be directly applied to languages like Mandarin, in which each character sequence can have multiple corresponding word sequences. To this end, we propose a novel decoding algorithm where a word-level lattice is constructed on-the-fly to consider all possible word sequences for each partial hypothesis. Then, the LM score of the hypothesis is obtained by intersecting the generated lattice with an external word N-gram LM. The proposed method is examined on both Attention-based Encoder-Decoder (AED) and Neural Transducer (NT) frameworks. Experiments suggest that our method consistently outperforms subword-level LMs, including N-gram LM and neural network LM. We achieve state-of-the-art results on both Aishell-1 (CER 4.18%) and Aishell-2 (CER 5.06%) datasets and reduce CER by 14.8% relatively on a 21K-hour Mandarin dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题