论文标题
3D手姿势估计和动作识别的层次颞变压器来自EgbB的RGB视频
Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos
论文作者
论文摘要
由于自我批判性和歧义,了解以Egentric RGB视频为中心的动态手动和动作是一项基本而又具有挑战性的任务。为了解决遮挡和歧义,我们开发了一个基于变压器的框架来利用时间信息以进行稳健的估计。注意到手动姿势估计和动作识别之间的不同时间粒度以及语义相关性,我们建立了一个网络层次结构,其中具有两个级联的变压器编码器,其中第一个为手姿势估算的短期时间cue利用了短期的时间cue,而后者的聚集物的每个pose pose和对象信息则在较长的时间范围内识别动作。我们的方法在两个第一人称手动基准(即FPHA和H2O)上取得了竞争成果。广泛的消融研究验证了我们的设计选择。
Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices.