您不需要更多的数据：通过文本到语音数据扩展改善端到端语音识别

论文标题

您不需要更多的数据：通过文本到语音数据扩展改善端到端语音识别

You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

论文作者

Laptev, Aleksandr, Korostik, Roman, Svischev, Aleksey, Andrusenko, Andrei, Medennikov, Ivan, Rybin, Sergey

论文摘要

数据增强是使端到端自动语音识别（ASR）执行的最有效方法之一，尤其是在处理低资源任务时。利用语音合成的最新进展（文本到语音或TTS），我们在ASR培训数据库上构建了TTS系统，然后使用合成的语音扩展数据以训练识别模型。我们认为，当培训数据量相对较低时，这种方法可以使端到端模型达到混合系统的质量。对于人工低至中等资源的设置，我们将所提出的增强与半监督的学习技术进行了比较。我们还通过将Griffin-Lim算法与我们的修改后的LPCNET进行比较，研究了Vocoder使用对最终ASR性能的影响。当使用外部语言模型应用时，我们的方法的表现要优于Librispeech测试清洁的半监督设置，而仅比可比的监督设置差33％。我们的系统为在Librispeech Train-Clean-100中训练的端到端ASR建立了一个竞争结果，测试清洁的WER为4.3％，而Test-Onke则13.5％。

Data augmentation is one of the most effective ways to make end-to-end automatic speech recognition (ASR) perform close to the conventional hybrid approach, especially when dealing with low-resource tasks. Using recent advances in speech synthesis (text-to-speech, or TTS), we build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. We argue that, when the training data amount is relatively low, this approach can allow an end-to-end model to reach hybrid systems' quality. For an artificial low-to-medium-resource setup, we compare the proposed augmentation with the semi-supervised learning technique. We also investigate the influence of vocoder usage on final ASR performance by comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an external language model, our approach outperforms a semi-supervised setup for LibriSpeech test-clean and only 33% worse than a comparable supervised setup. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.

下载PDF全文

下载文献需遵守相关版权规定

论文标题