我的多模型模型是否学习跨模式相互作用？这比您想象的要困难得多！

论文标题

我的多模型模型是否学习跨模式相互作用？这比您想象的要困难得多！

Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think!

论文作者

Hessel, Jack, Lee, Lillian

论文摘要

在多模式任务（例如视觉问题回答）中，建模表达性跨模式相互作用似乎至关重要。但是，有时表现出色的黑盒算法事实证明主要是利用数据中的单峰信号。我们提出了一种新的诊断工具，即经验多模态函数投影（EMAP），用于隔离跨模式相互作用是否可以改善给定模型在给定任务上的性能。此函数投影修改了模型预测，从而消除了跨模式相互作用，从而隔离了添加剂的单峰结构。对于七个图像+文本分类任务（在其中设置新的最新基准测试）中，我们发现，在许多情况下，删除跨模式交互作用几乎没有性能降级。令人惊讶的是，即使表现力模型具有考虑相互作用的能力，否则表现不佳的模型也是如此。因此，即使存在，绩效的改进也通常不能归因于跨模式特征相互作用。因此，我们建议多模式机器学习中的研究人员不仅报告单峰基线的性能，而且还报告了其最佳模型的效果。

Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, sometimes high-performing black-box algorithms turn out to be mostly exploiting unimodal signals in the data. We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task. This function projection modifies model predictions so that cross-modal interactions are eliminated, isolating the additive, unimodal structure. For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation. Surprisingly, this holds even when expressive models, with capacity to consider interactions, otherwise outperform less expressive models; thus, performance improvements, even when present, often cannot be attributed to consideration of cross-modal feature interactions. We hence recommend that researchers in multimodal machine learning report the performance not only of unimodal baselines, but also the EMAP of their best-performing model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题