论文标题
LSTM声学模型学会与图形对齐并发音
LSTM Acoustic Models Learn to Align and Pronounce with Graphemes
论文作者
论文摘要
世界语言的自动语音识别范围不断扩大。但是,标准音素系统需要难以获得的手工制作的词典。为了解决这个问题,我们为基于素式的语音识别器提出了一种培训方法,该方法可以以纯粹的数据驱动方式进行培训。由LSTM网络构建并接受了跨凝结损失的培训,我们研究的字母输出声学模型对于现实世界中的应用也非常实用,因为它们可以用传统的ASR堆栈组件来解码,语言模型和FST模型(例如FST),例如在许多语音应用中产生有用的质量质量的音频优势。我们表明,在大型数据集中训练时,素模型的音素输出对应物在WER中具有竞争力,具有优势,即素模型不需要明确的语言知识作为输入。我们进一步比较了音素和素模型产生的对齐方式,以证明使用四种以语言和书面形式在语言上不同的印度语言所学的发音质量。
Automated speech recognition coverage of the world's languages continues to expand. However, standard phoneme based systems require handcrafted lexicons that are difficult and expensive to obtain. To address this problem, we propose a training methodology for a grapheme-based speech recognizer that can be trained in a purely data-driven fashion. Built with LSTM networks and trained with the cross-entropy loss, the grapheme-output acoustic models we study are also extremely practical for real-world applications as they can be decoded with conventional ASR stack components such as language models and FST decoders, and produce good quality audio-to-grapheme alignments that are useful in many speech applications. We show that the grapheme models are competitive in WER with their phoneme-output counterparts when trained on large datasets, with the advantage that grapheme models do not require explicit linguistic knowledge as an input. We further compare the alignments generated by the phoneme and grapheme models to demonstrate the quality of the pronunciations learnt by them using four Indian languages that vary linguistically in spoken and written forms.