关于将声学特征转换为成人和儿童演讲的ASR的调查

论文标题

关于将声学特征转换为成人和儿童演讲的ASR的调查

An Investigation on Applying Acoustic Feature Conversion to ASR of Adult and Child Speech

论文作者

Liu, Wei, Li, Jingyu, Lee, Tan

论文摘要

由于培训数据量有限，与成人语音相比，儿童语音识别的表现通常不那么令人满意。由于领域不匹配，应用自动语音识别（ASR）系统直接在儿童语音上训练的自动语音识别系统（ASR）系统时，预计会有重大的绩效降解。本研究的重点是成人到儿童的声学特征转换，以减轻这种不匹配。在公平的实验环境中研究并比较了不同的声学特征转换方法，包括基于深神网络和基于信号处理的深度神经网络处理，其中使用相同数量的成人语音转换的声学特征从头开始训练ASR模型。实验结果表明，并非所有转化方法都会导致ASR性能增长。具体而言，作为经典的无监督域适应方法，统计匹配并未显示出有效性。发现基于解散的自动编码器（DAE）转换框架很有用，F0归一化的方法可以达到最佳性能。值得注意的是，转换特征的F0分布是反映转换质量的重要属性，同时利用成人的深层分类模型来做出判断，这表明是不合适的。

The performance of child speech recognition is generally less satisfactory compared to adult speech due to limited amount of training data. Significant performance degradation is expected when applying an automatic speech recognition (ASR) system trained on adult speech to child speech directly, as a result of domain mismatch. The present study is focused on adult-to-child acoustic feature conversion to alleviate this mismatch. Different acoustic feature conversion approaches, including deep neural network based and signal processing based, are investigated and compared under a fair experimental setting, in which converted acoustic features from the same amount of labeled adult speech are used to train the ASR models from scratch. Experimental results reveal that not all of the conversion methods lead to ASR performance gain. Specifically, as a classic unsupervised domain adaptation method, the statistic matching does not show an effectiveness. A disentanglement-based auto-encoder (DAE) conversion framework is found to be useful and the approach of F0 normalization achieves the best performance. It is noted that the F0 distribution of converted features is an important attribute to reflect the conversion quality, while utilizing an adult-child deep classification model to make judgment is shown to be inappropriate.

下载PDF全文

下载文献需遵守相关版权规定

论文标题