论文标题
改善了基于RNN-Transducer的语音识别的鲁棒性的鲁棒性
Improved Robustness to Disfluencies in RNN-Transducer Based Speech Recognition
论文作者
论文摘要
基于复发性神经网络传感器(RNN-T)的自动语音识别(ASR)对语音社区产生了兴趣。我们研究了旨在提高RNN-T ASR鲁棒性到语音疏远的鲁棒性的数据选择和准备选择,重点是部分单词。为了进行评估,我们使用干净的数据,带有漏洞的数据和一个单独的数据集,并带有口吃影响的语音。我们表明,在训练中包括少量数据后,识别识别的准确性以及结束和口吃的改进。增加培训数据的数量会带来更多的收益,而不会在干净的数据上退化。我们还表明,用专用的令牌代替部分单词有助于在发言和口吃的话语中获得更好的准确性。对我们最佳模型的评估显示,这两个评估集的相对降低了22.5%和16.4%。
Automatic Speech Recognition (ASR) based on Recurrent Neural Network Transducers (RNN-T) is gaining interest in the speech community. We investigate data selection and preparation choices aiming for improved robustness of RNN-T ASR to speech disfluencies with a focus on partial words. For evaluation we use clean data, data with disfluencies and a separate dataset with speech affected by stuttering. We show that after including a small amount of data with disfluencies in the training set the recognition accuracy on the tests with disfluencies and stuttering improves. Increasing the amount of training data with disfluencies gives additional gains without degradation on the clean data. We also show that replacing partial words with a dedicated token helps to get even better accuracy on utterances with disfluencies and stutter. The evaluation of our best model shows 22.5% and 16.4% relative WER reduction on those two evaluation sets.