论文标题
基于融合的多模式分类器对跨模式含量稀释的鲁棒性
Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content Dilutions
论文作者
论文摘要
随着多模式学习在各种高风险的社会任务中发现应用程序,研究其鲁棒性变得很重要。现有的工作集中在理解视觉和语言模型对基准任务上不可察觉的变化的鲁棒性。在这项工作中,我们研究了多模式分类器对跨模式稀释液的鲁棒性 - 合理的变化。我们开发了一个模型,该模型在给定多模式(图像 +文本)输入中,生成了其他稀释文本,该文本(a)与图像和现有文本保持相关性和局部连贯性,以及(b)添加到原始文本中,导致对多模态输入的错误分类。通过有关危机人道主义和情感检测任务的实验,我们发现基于任务的多模式分类器的性能分别下降了23.3%和22.5%,而我们的模型产生了稀释液。基于公制的基准和人类评估的基于公制的比较表明,我们的稀释液显示出更高的相关性和局部连贯性,同时在证明多模式分类器的脆性方面更有效。我们的工作旨在强调和鼓励进一步研究深层多模型的鲁棒性,以实现现实,尤其是在面向人类的社会应用中。代码和其他资源可在https://claws-lab.github.io/multimodal-robustness/上获得。
As multimodal learning finds applications in a wide variety of high-stakes societal tasks, investigating their robustness becomes important. Existing work has focused on understanding the robustness of vision-and-language models to imperceptible variations on benchmark tasks. In this work, we investigate the robustness of multimodal classifiers to cross-modal dilutions - a plausible variation. We develop a model that, given a multimodal (image + text) input, generates additional dilution text that (a) maintains relevance and topical coherence with the image and existing text, and (b) when added to the original text, leads to misclassification of the multimodal input. Via experiments on Crisis Humanitarianism and Sentiment Detection tasks, we find that the performance of task-specific fusion-based multimodal classifiers drops by 23.3% and 22.5%, respectively, in the presence of dilutions generated by our model. Metric-based comparisons with several baselines and human evaluations indicate that our dilutions show higher relevance and topical coherence, while simultaneously being more effective at demonstrating the brittleness of the multimodal classifiers. Our work aims to highlight and encourage further research on the robustness of deep multimodal models to realistic variations, especially in human-facing societal applications. The code and other resources are available at https://claws-lab.github.io/multimodal-robustness/.