论文标题

多模式数据的基于多阶段的特征融合人类活动识别

Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

论文作者

Choi, Hyeongju, Beedu, Apoorva, Haresamudram, Harish, Essa, Irfan

论文摘要

为了适当地协助人类的需求,人类活动识别(HAR)系统需要能够融合多种方式信息的能力。我们的假设是,视觉传感器,视觉和非视觉传感器倾向于提供互补的信息,从而解决其他方式的局限性。在这项工作中,我们提出了一个多模式框架,该框架学会了有效地结合RGB视频和IMU传感器的功能,并显示其对MMACT和UTD-MHAD数据集的稳健性。我们的模型经过两阶段的训练,在第一阶段,每个输入编码器都学会了有效提取特征,在第二阶段,在第二阶段学习结合了这些单个特征。与仅视频相比,我们显示出22%和11%的显着改善,而IMU仅在UTD-MHAD数据集上设置,而MMACT数据集则显示了20%和12%。通过广泛的实验,我们显示了模型在零射击设置和注释数据设置有限的鲁棒性。我们进一步将使用更多输入模式的最新方法进行比较,并表明我们的方法在更困难的MMACT数据集上的表现明显优于,并且在UTD-MHAD数据集中执行相当的性能。

To properly assist humans in their needs, human activity recognition (HAR) systems need the ability to fuse information from multiple modalities. Our hypothesis is that multimodal sensors, visual and non-visual tend to provide complementary information, addressing the limitations of other modalities. In this work, we propose a multi-modal framework that learns to effectively combine features from RGB Video and IMU sensors, and show its robustness for MMAct and UTD-MHAD datasets. Our model is trained in two-stage, where in the first stage, each input encoder learns to effectively extract features, and in the second stage, learns to combine these individual features. We show significant improvements of 22% and 11% compared to video only and IMU only setup on UTD-MHAD dataset, and 20% and 12% on MMAct datasets. Through extensive experimentation, we show the robustness of our model on zero shot setting, and limited annotated data setting. We further compare with state-of-the-art methods that use more input modalities and show that our method outperforms significantly on the more difficult MMact dataset, and performs comparably in UTD-MHAD dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源