以儿童为中心录音的开源语音类型分类器

论文标题

以儿童为中心录音的开源语音类型分类器

An open-source voice type classifier for child-centered daylong recordings

论文作者

Lavechin, Marvin, Bousbib, Ruben, Bredin, Hervé, Dupoux, Emmanuel, Cristia, Alejandrina

论文摘要

现实世界中的自发对话（例如以儿童为中心的录音中发现的对话）已被证明是要处理的最具挑战性的录音文件之一。然而，构建语音处理模型处理如此多种条件对于语言获取研究特别有用，在该研究中，研究人员对儿童听到和产生的语音的数量和质量以及对修复的早期诊断和测量效果感兴趣。在本文中，我们介绍了设计开源神经网络的方法，以将音频段分类为戴着录音装置的孩子产生的发声，其他儿童产生的发声，成人男性演讲和成人女性演讲。为此，我们收集了以儿童为中心的多样化的语料库，总计可达260小时的录音，并涵盖10种语言。我们的模型可以用作下游任务的输入，例如估计成人说话者产生的单词数量或儿童产生的语言单位数量。我们的体系结构将Sincnet过滤器与一堆经常性层和表现相结合，并以很大的边距为最先进的系统，即语言环境分析（LENA），这些系统已在许多儿童语言研究中使用。

Spontaneous conversations in real-world settings such as those found in child-centered recordings have been shown to be amongst the most challenging audio files to process. Nevertheless, building speech processing models handling such a wide variety of conditions would be particularly useful for language acquisition studies in which researchers are interested in the quantity and quality of the speech that children hear and produce, as well as for early diagnosis and measuring effects of remediation. In this paper, we present our approach to designing an open-source neural network to classify audio segments into vocalizations produced by the child wearing the recording device, vocalizations produced by other children, adult male speech, and adult female speech. To this end, we gathered diverse child-centered corpora which sums up to a total of 260 hours of recordings and covers 10 languages. Our model can be used as input for downstream tasks such as estimating the number of words produced by adult speakers, or the number of linguistic units produced by children. Our architecture combines SincNet filters with a stack of recurrent layers and outperforms by a large margin the state-of-the-art system, the Language ENvironment Analysis (LENA) that has been used in numerous child language studies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题