Husformer：用于多模式人类识别的多模式变压器

论文标题

Husformer：用于多模式人类识别的多模式变压器

Husformer: A Multi-Modal Transformer for Multi-Modal Human State Recognition

论文作者

Wang, Ruiqi, Jo, Wonse, Zhao, Dezhong, Wang, Weizheng, Yang, Baijian, Chen, Guohua, Min, Byung-Cheol

论文摘要

人类国家的认可是一个关键的话题，在人机系统中的普遍和重要应用。多模式融合（来自多个数据源的指标的组合）已显示为改善识别性能的合理方法。但是，尽管最近基于多模式的模型已经报告了有希望的结果，但它们通常无法利用复杂的融合策略，这些策略将在产生融合表示时对足够的跨模式相互作用进行建模；相反，当前方法依赖于冗长且不一致的数据预处理和功能制作。为了解决这一限制，我们为多模式的人类识别（称为Husformer）提出了一个端到端的多模式变压器框架。具体而言，我们建议使用跨模式变压器，这激发了一种模式，通过直接参与其他方式中揭示的潜在相关性来增强自己，以融合不同的方式，同时确保引入了跨模式相互作用的足够认识。随后，我们利用自我发项的变压器在融合表示中进一步优先考虑上下文信息。使用两种这样的注意机制，可以在融合过程中以及与高级特征有关的多模式信号的噪声和中断有效和适应性调整。关于两个人类情感语料库（DEAP和WESAD）和两个认知工作负载数据集（MOCAS和COGLOAD）的广泛实验表明，在认识到人类国家的情况下，我们的Husformer优于最先进的多模式底线，尤其是在与原始的多型余量相处，尤其是单个模式的使用，尤其是在与原始的多元型号相处。我们还进行了一项消融研究，以显示asterformer中每个组成部分的好处。

Human state recognition is a critical topic with pervasive and important applications in human-machine systems. Multi-modal fusion, the combination of metrics from multiple data sources, has been shown as a sound method for improving the recognition performance. However, while promising results have been reported by recent multi-modal-based models, they generally fail to leverage the sophisticated fusion strategies that would model sufficient cross-modal interactions when producing the fusion representation; instead, current methods rely on lengthy and inconsistent data preprocessing and feature crafting. To address this limitation, we propose an end-to-end multi-modal transformer framework for multi-modal human state recognition called Husformer. Specifically, we propose to use cross-modal transformers, which inspire one modality to reinforce itself through directly attending to latent relevance revealed in other modalities, to fuse different modalities while ensuring sufficient awareness of the cross-modal interactions introduced. Subsequently, we utilize a self-attention transformer to further prioritize contextual information in the fusion representation. Using two such attention mechanisms enables effective and adaptive adjustments to noise and interruptions in multi-modal signals during the fusion process and in relation to high-level features. Extensive experiments on two human emotion corpora (DEAP and WESAD) and two cognitive workload datasets (MOCAS and CogLoad) demonstrate that in the recognition of human state, our Husformer outperforms both state-of-the-art multi-modal baselines and the use of a single modality by a large margin, especially when dealing with raw multi-modal signals. We also conducted an ablation study to show the benefits of each component in Husformer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题