通过组合ConvlstM和3D卷积网络，改进了超声舌视频的处理

论文标题

通过组合ConvlstM和3D卷积网络，改进了超声舌视频的处理

Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks

论文作者

Shandiz, Amin Honarmandi, Toth, Laszlo

论文摘要

无声的语音界面旨在从记录关节运动的一系列超声舌图像中重建声学信号。关于舌头运动的信息的提取需要我们有效地处理整个图像序列，而不仅仅是单个图像。已经提出了几种方法来处理这样的顺序图像数据。经典的神经网络结构结合了二维卷积（2D-CNN）层，该层将图像与复发层（例如LSTM）分开处理图像，以融合信息。最近，已经表明，也可以应用一个3D-CNN网络，该网络可以并行地沿空间轴和时间轴提取信息，从而达到相似的精度，同时又耗时。第三种选择是应用鲜为人知的弯曲层类型，该类型通过用卷积操作替换矩阵乘法来结合LSTM和CNN层的优势。在本文中，我们通过实验比较了上述层的各种组合，以进行无声的语音接口任务，并通过混合模型获得了最佳结果，该模型由3D-CNN和ConvlstM层组合组合。与我们以前的3D-CNN模型相比，该混合网络的速度稍快，更小，更准确。（2+1）D CNN的组合。

Silent Speech Interfaces aim to reconstruct the acoustic signal from a sequence of ultrasound tongue images that records the articulatory movement. The extraction of information about the tongue movement requires us to efficiently process the whole sequence of images, not just as a single image. Several approaches have been suggested to process such a sequential image data. The classic neural network structure combines two-dimensional convolutional (2D-CNN) layers that process the images separately with recurrent layers (eg. an LSTM) on top of them to fuse the information along time. More recently, it was shown that one may also apply a 3D-CNN network that can extract information along both the spatial and the temporal axes in parallel, achieving a similar accuracy while being less time consuming. A third option is to apply the less well-known ConvLSTM layer type, which combines the advantages of LSTM and CNN layers by replacing matrix multiplication with the convolution operation. In this paper, we experimentally compared various combinations of the above mentions layer types for a silent speech interface task, and we obtained the best result with a hybrid model that consists of a combination of 3D-CNN and ConvLSTM layers. This hybrid network is slightly faster, smaller and more accurate than our previous 3D-CNN model. %with combination of (2+1)D CNN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题