通过四元素产品进行视觉问题回答的多层内容交互

论文标题

通过四元素产品进行视觉问题回答的多层内容交互

Multi-Layer Content Interaction Through Quaternion Product For Visual Question Answering

论文作者

Shi, Lei, Geng, Shijie, Shuang, Kai, Hori, Chiori, Liu, Songxiang, Gao, Peng, Su, Sen

论文摘要

近年来，多模式融合技术已大大提高了基于神经网络的视频描述/字幕，视觉询问（VQA）和视觉视觉场景吸引对话框（AVSD）的性能。大多数以前的方法仅探索多层特征融合的最后一层，同时省略了中间层的重要性。为了解决中间层的问题，我们建议有效的四基因块网络（QBN）不仅要学习最后一层的相互作用，而且还同时学习所有中间层的相互作用。在我们提出的QBN中，我们使用整体文本功能来指导视觉功能的更新。同时，汉密尔顿四季度产品可以有效地执行从较高层到较低层的信息流，以视觉和文本方式。评估结果表明，即使使用超级BERT或视觉BERT预训练模型，我们的QBN也提高了VQA 2.0的性能。已经进行了广泛的消融研究，以证明本研究中每个提出的模块的影响。

Multi-modality fusion technologies have greatly improved the performance of neural network-based Video Description/Caption, Visual Question Answering (VQA) and Audio Visual Scene-aware Dialog (AVSD) over the recent years. Most previous approaches only explore the last layers of multiple layer feature fusion while omitting the importance of intermediate layers. To solve the issue for the intermediate layers, we propose an efficient Quaternion Block Network (QBN) to learn interaction not only for the last layer but also for all intermediate layers simultaneously. In our proposed QBN, we use the holistic text features to guide the update of visual features. In the meantime, Hamilton quaternion products can efficiently perform information flow from higher layers to lower layers for both visual and text modalities. The evaluation results show our QBN improved the performance on VQA 2.0, even though using surpass large scale BERT or visual BERT pre-trained models. Extensive ablation study has been carried out to testify the influence of each proposed module in this study.

下载PDF全文

下载文献需遵守相关版权规定

论文标题