通过验证的语音模型改善低资源语音识别：持续预处理与半监督培训

论文标题

通过验证的语音模型改善低资源语音识别：持续预处理与半监督培训

Improving Low-Resource Speech Recognition with Pretrained Speech Models: Continued Pretraining vs. Semi-Supervised Training

论文作者

DeHaven, Mitchell, Billa, Jayadev

论文摘要

基于自我监督的变压器模型，例如WAV2VEC 2.0和Hubert，对现有的自动语音识别方法（ASR）产生了重大改进。这在用可用标记的数据进行微调时，在许多语言的基于WAV2VEC 2.0预处理的XLSR-53模型的性能中很明显。但是，鉴定这些模型的性能可能取决于预训练数据集中包含的语言或类似语言数据的数量。在本文中，我们使用几种低资源语言的XLSR-53预处理模型进行了未标记的语言音频数据进行研究。 COPT比半监督训练（SST）更有效，这是使用ASR中未标记数据的标准方法，因为它忽略了对未标记数据的伪标记的需求。我们在单词错误率（WERS）中显示了COPT结果，比使用SST等于或稍好。此外，我们表明，使用COPT模型进行伪标记，并在SST中使用这些标签，从而进一步改善了WER。

Self-supervised Transformer based models, such as wav2vec 2.0 and HuBERT, have produced significant improvements over existing approaches to automatic speech recognition (ASR). This is evident in the performance of the wav2vec 2.0 based pretrained XLSR-53 model across many languages when fine-tuned with available labeled data. However, the performance from finetuning these models can be dependent on the amount of in-language or similar-to-in-language data included in the pretraining dataset. In this paper we investigate continued pretraining (CoPT) with unlabeled in-language audio data on the XLSR-53 pretrained model in several low-resource languages. CoPT is more computationally efficient than semi-supervised training (SST), the standard approach of utilizing unlabeled data in ASR, since it omits the need for pseudo-labeling of the unlabeled data. We show CoPT results in word error rates (WERs), equal to or slightly better than using SST. In addition, we show that using the CoPT model for pseudo-labeling, and using these labels in SST, results in further improvements in WER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题