利用跨域和跨语言超声舌头成像特征，用于老年人和违反语音识别

论文标题

利用跨域和跨语言超声舌头成像特征，用于老年人和违反语音识别

Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

论文作者

Hu, Shujie, Xie, Xurong, Geng, Mengzhe, Cui, Mingyu, Deng, Jiajun, Li, Guinan, Wang, Tianzi, Liu, Xunying, Meng, Helen

论文摘要

关节特征本质上是声信号失真的不变，并且已成功地纳入了为正常语音设计的自动语音识别（ASR）系统。它们在非典型任务领域（例如老年人和跨语言的言语无序）的实际应用通常受到从目标扬声器收集此类专家数据的困难。 This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training before being cross-domain and cross-lingual adapted to three datasets across two languages: the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora;以及英语torgo违反语音数据，以产生基于UTI的发音特征。对三个任务进行的实验，建议合并生成的关节特征始终优于基线TDNN和构象异构体ASR系统，仅使用声学特征构建仅通过统计学上有意义的单词或字符错误率降低到4.75％，2.59％和2.59％和2.07％的绝对范围（14.69％和22.64％和22.72的相关数据），后者是数据增强的数据，后者是10.64％和22.72相关的数据。应用解码。

Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems designed for normal speech. Their practical application to atypical task domains such as elderly and disordered speech across languages is often limited by the difficulty in collecting such specialist data from target speakers. This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training before being cross-domain and cross-lingual adapted to three datasets across two languages: the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora; and the English TORGO dysarthric speech data, to produce UTI based articulatory features. Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems constructed using acoustic features only by statistically significant word or character error rate reductions up to 4.75%, 2.59% and 2.07% absolute (14.69%, 10.64% and 22.72% relative) after data augmentation, speaker adaptation and cross system multi-pass decoding were applied.

下载PDF全文

下载文献需遵守相关版权规定

论文标题