论文标题
Maestro-U:利用零监督语音的联合语音文本表示学习ASR
Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR
论文作者
论文摘要
培训最先进的自动语音识别(ASR)模型通常需要大量的转录语音。在这项工作中,我们证明了一个模式匹配的联合语音和文本模型可以利用,以训练大型多语言ASR模型,而无需对某些语言进行任何监督(手动转录)语音。本文探讨了在大规模多语言,零监督语音,现实世界中使用共同学习的语音和文本表示的使用,以扩展ASR覆盖的一组语言,仅使用目标语言使用未标记的语音和文本。使用Fleurs数据集,我们将任务定义为覆盖$ 102 $的语言,这些语言的转录语音可用于这些语言的$ 52 $,可用于提高剩余$ 50 $的端到端ASR质量。首先,我们表明,通过将语音表示形式与字节级文本表示和语言嵌入的使用相结合,我们可以大大降低语言上的字符错误率(CER),而没有监督语音从64.8 \%\%\%\%\%,相对减少了53 \%。其次,使用南亚语言的子集,我们表明Maestro-U也可以从具有监督语音的语言中促进知识转移,即使没有限制了石墨性重叠。总体而言,Maestro-U将差距缩小到甲骨文的差距68.5 \%相对,并将19种语言的CER降低到15 \%以下。
Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some languages. This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero supervised speech, real-world setting to expand the set of languages covered by ASR with only unlabeled speech and text in the target languages. Using the FLEURS dataset, we define the task to cover $102$ languages, where transcribed speech is available in $52$ of these languages and can be used to improve end-to-end ASR quality on the remaining $50$. First, we show that by combining speech representations with byte-level text representations and use of language embeddings, we can dramatically reduce the Character Error Rate (CER) on languages with no supervised speech from 64.8\% to 30.8\%, a relative reduction of 53\%. Second, using a subset of South Asian languages we show that Maestro-U can promote knowledge transfer from languages with supervised speech even when there is limited to no graphemic overlap. Overall, Maestro-U closes the gap to oracle performance by 68.5\% relative and reduces the CER of 19 languages below 15\%.