生成的对抗培训数据适应非常低的资源自动语音识别

论文标题

生成的对抗培训数据适应非常低的资源自动语音识别

Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition

论文作者

Matsuura, Kohei, Mimura, Masato, Sakai, Shinsuke, Kawahara, Tatsuya

论文摘要

重要的是要转录和存档濒危语言的语音数据，以保存言语文化和自动语音识别（ASR）是促进这一过程的强大工具。但是，由于濒临灭绝的语言通常没有很多说话者的语料库，因此对其训练的ASR模型的性能总体上很差。然而，我们经常留下许多自发语音数据的录音。在这项工作中，为了减轻这种说话者的稀疏性问题，我们建议转换整个培训语音数据，并使其听起来像测试扬声器，以便为该扬声器开发高度准确的ASR系统。为此，我们利用基于自行车的非平行语音转换技术来构建贴有测试扬声器语音的标签培训数据。我们评估了这两种低资源语料库，即Ainu和Mboshi，评估了这种说话者的适应方法。我们在AINU语料库的电话错误率相对相对提高了35-60％，在Mboshi语料库上获得了40％的相对改善。这种方法的表现优于两种常规方法，即对这两个语料库进行无监督的适应和多语言培训。

It is important to transcribe and archive speech data of endangered languages for preserving heritages of verbal culture and automatic speech recognition (ASR) is a powerful tool to facilitate this process. However, since endangered languages do not generally have large corpora with many speakers, the performance of ASR models trained on them are considerably poor in general. Nevertheless, we are often left with a lot of recordings of spontaneous speech data that have to be transcribed. In this work, for mitigating this speaker sparsity problem, we propose to convert the whole training speech data and make it sound like the test speaker in order to develop a highly accurate ASR system for this speaker. For this purpose, we utilize a CycleGAN-based non-parallel voice conversion technology to forge a labeled training data that is close to the test speaker's speech. We evaluated this speaker adaptation approach on two low-resource corpora, namely, Ainu and Mboshi. We obtained 35-60% relative improvement in phone error rate on the Ainu corpus, and 40% relative improvement was attained on the Mboshi corpus. This approach outperformed two conventional methods namely unsupervised adaptation and multilingual training with these two corpora.

下载PDF全文

下载文献需遵守相关版权规定

论文标题