仔细研究视觉和语言预训练模型的鲁棒性

论文标题

仔细研究视觉和语言预训练模型的鲁棒性

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

论文作者

Li, Linjie, Gan, Zhe, Liu, Jingjing

论文摘要

大规模的预训练的多模式变压器（例如Vilbert和Uniter）已将视觉和语言（V+L）研究的最新状态推向了新的水平。尽管在标准任务上取得了令人印象深刻的表现，但迄今为止，仍不清楚这些预训练的模型的鲁棒性。为了调查，我们对4种不同类型的V+L特定模型鲁棒性的现有预训练模型进行了大量的评估：（i）语言变化；（ii）逻辑推理；（iii）视觉内容操纵；（iv）答案分配转移。有趣的是，通过标准模型登录，预先训练的V+L模型已经表现出比许多特定于任务的最先进方法更好的鲁棒性。为了进一步增强模型的鲁棒性，我们提出了一种通用和高效的方法，该方法在嵌入空间中学习了多模式的对抗噪声发生器，以欺骗预训练的V+L模型。芒果与以前的研究不同的研究不同，芒果是任务不合时宜的，并且可以超越旨在评估鲁棒性广泛方面的各种任务的预训练模型的普遍性能提升。全面的实验表明，芒果在9个健壮性基准中的7个中实现了新的最新技术，从而超过了现有方法的大幅度。作为第一个关于V+L鲁棒性的综合研究，这项工作将预训练的模型的鲁棒性带入了焦点，为未来的研究指向了新的方向。

Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level. Although achieving impressive performance on standard tasks, to date, it still remains unclear how robust these pre-trained models are. To investigate, we conduct a host of thorough evaluations on existing pre-trained models over 4 different types of V+L specific model robustness: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Interestingly, by standard model finetuning, pre-trained V+L models already exhibit better robustness than many task-specific state-of-the-art methods. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models. Differing from previous studies focused on one specific type of robustness, Mango is task-agnostic, and enables universal performance lift for pre-trained models over diverse tasks designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new state of the art on 7 out of 9 robustness benchmarks, surpassing existing methods by a significant margin. As the first comprehensive study on V+L robustness, this work puts robustness of pre-trained models into sharper focus, pointing new directions for future study.

下载PDF全文

下载文献需遵守相关版权规定

论文标题