基于视频的跨模式辅助网络用于多模式分析

论文标题

基于视频的跨模式辅助网络用于多模式分析

Video-based Cross-modal Auxiliary Network for Multimodal Sentiment Analysis

论文作者

Chen, Rongfei, Zhou, Wenju, Li, Yang, Zhou, Huiyu

论文摘要

多模式情感分析由于其在多模式相互作用中的信息互补性而具有广泛的应用。以前的作品更多地集中在研究有效的联合表示方面，但他们很少认为单峰多模式融合的单峰特征提取和数据冗余不足。在本文中，提出了一个基于视频的跨模式辅助网络（VCAN），该网络由音频特征映射模块和跨模式选择模块组成。第一个模块旨在大大提高音频功能提取的特征多样性，旨在通过提供更全面的声学表示来提高分类精度。为了授权该模型处理冗余视觉特征，第二个模块是在集成视听数据期间有效过滤冗余视觉框架的。此外，引入了一个由几个图像分类网络组成的分类器组来预测情感极性和情感类别。关于RAVDESS，CMU-MOSI和CMU-MOSEI基准的广泛实验结果表明，VCAN明显优于提高多模式情感分析的分类准确性的最新方法。

Multimodal sentiment analysis has a wide range of applications due to its information complementarity in multimodal interactions. Previous works focus more on investigating efficient joint representations, but they rarely consider the insufficient unimodal features extraction and data redundancy of multimodal fusion. In this paper, a Video-based Cross-modal Auxiliary Network (VCAN) is proposed, which is comprised of an audio features map module and a cross-modal selection module. The first module is designed to substantially increase feature diversity in audio feature extraction, aiming to improve classification accuracy by providing more comprehensive acoustic representations. To empower the model to handle redundant visual features, the second module is addressed to efficiently filter the redundant visual frames during integrating audiovisual data. Moreover, a classifier group consisting of several image classification networks is introduced to predict sentiment polarities and emotion categories. Extensive experimental results on RAVDESS, CMU-MOSI, and CMU-MOSEI benchmarks indicate that VCAN is significantly superior to the state-of-the-art methods for improving the classification accuracy of multimodal sentiment analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题