暂时叠加的跨界模块，用于有效连续手语

论文标题

暂时叠加的跨界模块，用于有效连续手语

Temporal superimposed crossover module for effective continuous sign language

论文作者

Zhu, Qidan, Li, Jing, Yuan, Fei, Gan, Quan

论文摘要

连续手语识别（CSLR）的最终目标是促进特殊人和普通人之间的沟通，这需要一定程度的实时和部署能力。但是，在先前关于CSLR的研究中，很少关注实时和部署能力。 In order to improve the real-time and deploy-ability of the model, this paper proposes a zero parameter, zero computation temporal superposition crossover module(TSCM), and combines it with 2D convolution to form a "TSCM+2D convolution" hybrid convolution, which enables 2D convolution to have strong spatial-temporal modelling capability with zero parameter increase and lower deployment cost compared with other spatial-temporal卷积。基于TSCM的总体CSLR模型基于本文改进的Resblockt网络。将“ TSCM+2D卷积”的混合卷积应用于Resnet网络的重构，以形成新的Resblockt，并引入了随机梯度停止和多级CTC损失来训练模型，从而减少了最终识别率，从而减少了培训记忆使用，并减少了Resnet网络从图像识别范围扩展到视频识别任务。此外，这项研究是CSLR中的第一个仅使用2D卷积提取手语视频暂时空间特征来端到端学习以进行识别。在两个大规模连续语言数据集上进行的实验证明了该方法的有效性并获得了高度竞争性的结果。

The ultimate goal of continuous sign language recognition(CSLR) is to facilitate the communication between special people and normal people, which requires a certain degree of real-time and deploy-ability of the model. However, in the previous research on CSLR, little attention has been paid to the real-time and deploy-ability. In order to improve the real-time and deploy-ability of the model, this paper proposes a zero parameter, zero computation temporal superposition crossover module(TSCM), and combines it with 2D convolution to form a "TSCM+2D convolution" hybrid convolution, which enables 2D convolution to have strong spatial-temporal modelling capability with zero parameter increase and lower deployment cost compared with other spatial-temporal convolutions. The overall CSLR model based on TSCM is built on the improved ResBlockT network in this paper. The hybrid convolution of "TSCM+2D convolution" is applied to the ResBlock of the ResNet network to form the new ResBlockT, and random gradient stop and multi-level CTC loss are introduced to train the model, which reduces the final recognition WER while reducing the training memory usage, and extends the ResNet network from image classification task to video recognition task. In addition, this study is the first in CSLR to use only 2D convolution extraction of sign language video temporal-spatial features for end-to-end learning for recognition. Experiments on two large-scale continuous sign language datasets demonstrate the effectiveness of the proposed method and achieve highly competitive results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题