论文标题
DM $^2 $ S $^2 $:深层多模式序列集,层次模式关注
DM$^2$S$^2$: Deep Multi-Modal Sequence Sets with Hierarchical Modality Attention
论文作者
论文摘要
在各种Web应用程序(例如数字广告和电子商务)中使用多模式数据的兴趣越来越大。从多模式数据中提取重要信息的典型方法取决于结合了来自多个编码器的特征表示的中型构造。但是,随着模式数量的增加,中融合模型结构的几个潜在问题会出现,例如串联多模式特征和缺失模态的维度增加。为了解决这些问题,我们提出了一个新概念,该概念将多模式输入视为一组序列,即深度多模式序列集(DM $^2 $ S $^2 $)。 Our set-aware concept consists of three components that capture the relationships among multiple modalities: (a) a BERT-based encoder to handle the inter- and intra-order of elements in the sequences, (b) intra-modality residual attention (IntraMRA) to capture the importance of the elements in a modality, and (c) inter-modality residual attention (InterMRA) to enhance the importance of elements with modality-level granularity further.我们的概念表现出与以前的设定感知模型相当或更好的性能。此外,我们证明了学识渊博的Intermra和Intramra权重的可视化可以提供对预测结果的解释。
There is increasing interest in the use of multimodal data in various web applications, such as digital advertising and e-commerce. Typical methods for extracting important information from multimodal data rely on a mid-fusion architecture that combines the feature representations from multiple encoders. However, as the number of modalities increases, several potential problems with the mid-fusion model structure arise, such as an increase in the dimensionality of the concatenated multimodal features and missing modalities. To address these problems, we propose a new concept that considers multimodal inputs as a set of sequences, namely, deep multimodal sequence sets (DM$^2$S$^2$). Our set-aware concept consists of three components that capture the relationships among multiple modalities: (a) a BERT-based encoder to handle the inter- and intra-order of elements in the sequences, (b) intra-modality residual attention (IntraMRA) to capture the importance of the elements in a modality, and (c) inter-modality residual attention (InterMRA) to enhance the importance of elements with modality-level granularity further. Our concept exhibits performance that is comparable to or better than the previous set-aware models. Furthermore, we demonstrate that the visualization of the learned InterMRA and IntraMRA weights can provide an interpretation of the prediction results.