多模式数据的基于多阶段的特征融合人类活动识别

论文标题

多模式数据的基于多阶段的特征融合人类活动识别

Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

论文作者

Choi, Hyeongju, Beedu, Apoorva, Haresamudram, Harish, Essa, Irfan

论文摘要

为了适当地协助人类的需求，人类活动识别（HAR）系统需要能够融合多种方式信息的能力。我们的假设是，视觉传感器，视觉和非视觉传感器倾向于提供互补的信息，从而解决其他方式的局限性。在这项工作中，我们提出了一个多模式框架，该框架学会了有效地结合RGB视频和IMU传感器的功能，并显示其对MMACT和UTD-MHAD数据集的稳健性。我们的模型经过两阶段的训练，在第一阶段，每个输入编码器都学会了有效提取特征，在第二阶段，在第二阶段学习结合了这些单个特征。与仅视频相比，我们显示出22％和11％的显着改善，而IMU仅在UTD-MHAD数据集上设置，而MMACT数据集则显示了20％和12％。通过广泛的实验，我们显示了模型在零射击设置和注释数据设置有限的鲁棒性。我们进一步将使用更多输入模式的最新方法进行比较，并表明我们的方法在更困难的MMACT数据集上的表现明显优于，并且在UTD-MHAD数据集中执行相当的性能。

To properly assist humans in their needs, human activity recognition (HAR) systems need the ability to fuse information from multiple modalities. Our hypothesis is that multimodal sensors, visual and non-visual tend to provide complementary information, addressing the limitations of other modalities. In this work, we propose a multi-modal framework that learns to effectively combine features from RGB Video and IMU sensors, and show its robustness for MMAct and UTD-MHAD datasets. Our model is trained in two-stage, where in the first stage, each input encoder learns to effectively extract features, and in the second stage, learns to combine these individual features. We show significant improvements of 22% and 11% compared to video only and IMU only setup on UTD-MHAD dataset, and 20% and 12% on MMAct datasets. Through extensive experimentation, we show the robustness of our model on zero shot setting, and limited annotated data setting. We further compare with state-of-the-art methods that use more input modalities and show that our method outperforms significantly on the more difficult MMact dataset, and performs comparably in UTD-MHAD dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题