Ultra2speech-超声舌图像的共振体频率估计和跟踪的深度学习框架

论文标题

Ultra2speech-超声舌图像的共振体频率估计和跟踪的深度学习框架

Ultra2Speech -- A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images

论文作者

Saha, Pramit, Liu, Yadong, Gick, Bryan, Fels, Sidney

论文摘要

由于每年的严重疾病，成千上万的人需要手术清除喉部，因此需要一种替代形式的交流才能在失去语音盒后表达语音。这项工作解决了基于超声（US）舌头图像的发音到声学映射问题，用于开发无声的语音界面（SSI），可以为他们提供日常互动的帮助。我们的方法是通过从我们图像中选择设置的最佳特征并将这些功能映射到声学空间来自动提取舌头运动信息。我们使用一种新颖的深度学习体系结构来绘制我们所谓的Ultrasound2formant（U2F）网络中的实力的美国探针中的舌头图像。它使用混合时空3D卷积，然后进行功能改组，以估算和跟踪来自美国图像的元音实力。然后，通过克拉特合成器，将共振剂值用于合成连续变化的元音轨迹。我们的最佳模型可在回归任务中实现99.96％的R平方（R^2）。我们的网络为SSI奠定了基础，因为它成功地自动跟踪了舌头轮廓作为内部表示，而无需任何明确的注释。

Thousands of individuals need surgical removal of their larynx due to critical diseases every year and therefore, require an alternative form of communication to articulate speech sounds after the loss of their voice box. This work addresses the articulatory-to-acoustic mapping problem based on ultrasound (US) tongue images for the development of a silent-speech interface (SSI) that can provide them with an assistance in their daily interactions. Our approach targets automatically extracting tongue movement information by selecting an optimal feature set from US images and mapping these features to the acoustic space. We use a novel deep learning architecture to map US tongue images from the US probe placed beneath a subject's chin to formants that we call, Ultrasound2Formant (U2F) Net. It uses hybrid spatio-temporal 3D convolutions followed by feature shuffling, for the estimation and tracking of vowel formants from US images. The formant values are then utilized to synthesize continuous time-varying vowel trajectories, via Klatt Synthesizer. Our best model achieves R-squared (R^2) measure of 99.96% for the regression task. Our network lays the foundation for an SSI as it successfully tracks the tongue contour automatically as an internal representation without any explicit annotation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题