论文标题
Voxlingua107:用于口语识别的数据集
VoxLingua107: a Dataset for Spoken Language Recognition
论文作者
论文摘要
本文研究了自动收集的Web音频数据的使用,以实现口语识别的任务。我们从特定于语言的Wikipedia数据中生成半随机搜索短语,然后用来从YouTube检索107种语言的视频。语音活动检测和说话者诊断用于从包含语音的视频中提取段。过滤后用于从数据库中删除可能不在给定语言的数据库中,将正确标记的细分的比例增加到98%,基于众筹的验证。最终的训练集(Voxlingua107)的大小为6628小时(平均每语言62小时),并伴随着1609个经过验证的话语的评估集。我们使用数据来构建用于多种口语标识任务的语言识别模型。实验表明,使用自动检索的训练数据为使用手工标记的专有数据集提供了竞争性结果。该数据集公开可用。
This paper investigates the use of automatically collected web audio data for the task of spoken language recognition. We generate semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages. Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech. Post-filtering is used to remove segments from the database that are likely not in the given language, increasing the proportion of correctly labeled segments to 98%, based on crowd-sourced verification. The size of the resulting training set (VoxLingua107) is 6628 hours (62 hours per language on the average) and it is accompanied by an evaluation set of 1609 verified utterances. We use the data to build language recognition models for several spoken language identification tasks. Experiments show that using the automatically retrieved training data gives competitive results to using hand-labeled proprietary datasets. The dataset is publicly available.