论文标题
双流网络,用于手语识别和翻译
Two-Stream Network for Sign Language Recognition and Translation
论文作者
论文摘要
符号语言是使用手动发音和非手动元素传达信息的视觉语言。对于手语识别和翻译,大多数现有方法将RGB视频直接编码为隐藏表示形式。但是,RGB视频是具有大量视觉冗余性的原始信号,导致编码器忽略了关键信息以了解手语的理解。为了减轻此问题并更好地整合域知识,例如握手和身体运动,我们引入了一个双视觉编码器,其中包含两个单独的流,以建模原始视频和由现成的关键点估算器生成的关键点序列。为了使两个流相互作用,我们探索了各种技术,包括双向侧连接,签署具有辅助监督的金字塔网络以及框架级别的自我缩减。所得模型称为TwoStream-SLR,该模型具有手语识别(SLR)。只需连接额外的翻译网络,将Twostream-SLR扩展到手语翻译(SLT)模型TwoStream-SLT。在实验上,我们的TwoStream-SLR和TwoStream-SLT在SLR和SLT任务上实现了一系列数据集的最新性能,包括Phoenix-2014,Phoenix-2014T和CSL Daily。代码和型号可在以下网址提供:https://github.com/fangyunwei/slrt。
Sign languages are visual languages using manual articulations and non-manual elements to convey information. For sign language recognition and translation, the majority of existing approaches directly encode RGB videos into hidden representations. RGB videos, however, are raw signals with substantial visual redundancy, leading the encoder to overlook the key information for sign language understanding. To mitigate this problem and better incorporate domain knowledge, such as handshape and body movement, we introduce a dual visual encoder containing two separate streams to model both the raw videos and the keypoint sequences generated by an off-the-shelf keypoint estimator. To make the two streams interact with each other, we explore a variety of techniques, including bidirectional lateral connection, sign pyramid network with auxiliary supervision, and frame-level self-distillation. The resulting model is called TwoStream-SLR, which is competent for sign language recognition (SLR). TwoStream-SLR is extended to a sign language translation (SLT) model, TwoStream-SLT, by simply attaching an extra translation network. Experimentally, our TwoStream-SLR and TwoStream-SLT achieve state-of-the-art performance on SLR and SLT tasks across a series of datasets including Phoenix-2014, Phoenix-2014T, and CSL-Daily. Code and models are available at: https://github.com/FangyunWei/SLRT.