多语言零资源语音识别基础基于自我训练的预训练的声学模型

论文标题

多语言零资源语音识别基础基于自我训练的预训练的声学模型

Multilingual Zero Resource Speech Recognition Base on Self-Supervise Pre-Trained Acoustic Models

论文作者

Wang, Haoyu, Zhang, Wei-Qiang, Suo, Hongbin, Wan, Yulong

论文摘要

标记的音频数据不足以为世界上大多数语言建立令人满意的语音识别系统。有一些零资源方法试图在没有标记的目标语言的音频数据的情况下执行音素或单词级的语音识别，但是这些方法的错误率通常太高，无法在现实世界中应用。最近，发现自我抑制预培训模型的表示能力在零资源音素识别中非常有益。就我们而言，本文是将预培训模型扩展到单词级零资源语音识别的首次尝试。这是通过在IPA音素转录上微调预训练的模型并使用对额外文本训练的语言模型进行解码来完成的。 WAV2VEC 2.0和Hubert模型上的实验表明，该方法在某些语言上可以达到不到20％的单词错误率，而8种语言的平均错误率为33.77％。

Labeled audio data is insufficient to build satisfying speech recognition systems for most of the languages in the world. There have been some zero-resource methods trying to perform phoneme or word-level speech recognition without labeled audio data of the target language, but the error rate of these methods is usually too high to be applied in real-world scenarios. Recently, the representation ability of self-supervise pre-trained models has been found to be extremely beneficial in zero-resource phoneme recognition. As far as we are concerned, this paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition. This is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts. Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less than 20% word error rate on some languages, and the average error rate on 8 languages is 33.77%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题