MLS：语音研究的大型多语言数据集

论文标题

MLS：语音研究的大型多语言数据集

MLS: A Large-Scale Multilingual Dataset for Speech Research

论文作者

Pratap, Vineel, Xu, Qiantong, Sriram, Anuroop, Synnaeve, Gabriel, Collobert, Ronan

论文摘要

本文介绍了多种语言（MLS）数据集，这是一种适合语音研究的大型多语言语料库。该数据集源自Librivox的读取有声读物，由8种语言组成，包括约44.5万小时的英语和其他语言的总计约6K小时。此外，我们提供语言模型（LM）和基线自动语音识别（ASR）模型以及数据集中的所有语言。我们认为，如此大的转录数据集将在ASR和文本到语音（TTS）研究中开放新的途径。该数据集将在http://www.opensl.org上免费提供。

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.

下载PDF全文

下载文献需遵守相关版权规定

论文标题