可学习的无关模态辍学，以在特定于模式的带注释视频上进行多模式动作识别

论文标题

可学习的无关模态辍学，以在特定于模式的带注释视频上进行多模式动作识别

Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos

论文作者

Alfasly, Saghir, Lu, Jian, Xu, Chen, Zou, Yuru

论文摘要

假设视频数据集是多模式注释的，其中听觉和视觉模态都被标记或相关，当前的多模式方法应用模态融合或交叉模式关注。但是，有效利用特定视觉的注释视频中的音频方式进行动作识别是特别挑战。为了应对这一挑战，我们提出了一个新颖的视听框架，该框架有效地利用了任何完全视觉的注释数据集中的音频方式。我们采用语言模型（例如BERT）来构建语义音频视频标签词典（SAVLD），该字典（SAVLD）将每个视频标签映射到其最相关的音频标签，在该标签中，SAVLD充当音频和视频数据集之间的桥梁。然后，使用鉴定音频多标签模型的SAVLD来估计训练阶段的视听方式相关性。因此，提出了一种新颖的可学习无关模态辍学（IMD），以完全删除无关的音频方式，仅融合相关的方式。此外，我们提出了一种新的两流视频变压器，以有效地对视觉方式进行建模。在包括Kinetics400和UCF-101在内的几个特定视觉的注释数据集中，结果验证了我们的框架，因为它的表现优于最相关的动作识别方法。

With the assumption that a video dataset is multimodality annotated in which auditory and visual modalities both are labeled or class-relevant, current multimodal methods apply modality fusion or cross-modality attention. However, effectively leveraging the audio modality in vision-specific annotated videos for action recognition is of particular challenge. To tackle this challenge, we propose a novel audio-visual framework that effectively leverages the audio modality in any solely vision-specific annotated dataset. We adopt the language models (e.g., BERT) to build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels in which SAVLD serves as a bridge between audio and video datasets. Then, SAVLD along with a pretrained audio multi-label model are used to estimate the audio-visual modality relevance during the training phase. Accordingly, a novel learnable irrelevant modality dropout (IMD) is proposed to completely drop out the irrelevant audio modality and fuse only the relevant modalities. Moreover, we present a new two-stream video Transformer for efficiently modeling the visual modalities. Results on several vision-specific annotated datasets including Kinetics400 and UCF-101 validated our framework as it outperforms most relevant action recognition methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题