论文标题
深度多模式集
Deep Multi-Modal Sets
论文作者
论文摘要
许多与视觉相关的任务受益于多种方式的推理,以利用数据的互补视图,以尝试学习强大的嵌入空间。大多数基于学习的方法依赖于晚期融合技术,该技术将多种特征类型编码和串联,然后多层感知器(MLP)结合了融合的嵌入以做出预测。这有几个局限性,例如不自然的执行,所有特征始终存在,并且在任何给定时间仅限制特征模式的恒定数量。此外,随着添加更多方式,串联嵌入的增长。为了减轻这种情况,我们提出了深层多模式集:一种将特征集合为无序集的技术,而不是一个持续不断增长的固定尺寸向量。该集合的构造是使我们具有不变性来置于特征模式以及集合的基础性的排列。我们还将证明,通过模型体系结构中的特定选择,我们可以产生可解释的特征性能,以便在推理时间我们可以观察到哪种方式最大程度地促进了预测。考虑到这一点,我们证明了一个可扩展的多模式框架,这些框架的原因是不同方式学习各种类型的任务。我们演示了两个多模式数据集(ADS-Paralletity [34]和MM-IMDB [1])上的新最新性能。
Many vision-related tasks benefit from reasoning over multiple modalities to leverage complementary views of data in an attempt to learn robust embedding spaces. Most deep learning-based methods rely on a late fusion technique whereby multiple feature types are encoded and concatenated and then a multi layer perceptron (MLP) combines the fused embedding to make predictions. This has several limitations, such as an unnatural enforcement that all features be present at all times as well as constraining only a constant number of occurrences of a feature modality at any given time. Furthermore, as more modalities are added, the concatenated embedding grows. To mitigate this, we propose Deep Multi-Modal Sets: a technique that represents a collection of features as an unordered set rather than one long ever-growing fixed-size vector. The set is constructed so that we have invariance both to permutations of the feature modalities as well as to the cardinality of the set. We will also show that with particular choices in our model architecture, we can yield interpretable feature performance such that during inference time we can observe which modalities are most contributing to the prediction.With this in mind, we demonstrate a scalable, multi-modal framework that reasons over different modalities to learn various types of tasks. We demonstrate new state-of-the-art performance on two multi-modal datasets (Ads-Parallelity [34] and MM-IMDb [1]).