了解生成场景描述的V＆L模型中的跨模式相互作用

论文标题

了解生成场景描述的V＆L模型中的跨模式相互作用

Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions

论文作者

Cafagna, Michele, van Deemter, Kees, Gatt, Albert

论文摘要

图像字幕模型倾向于以以对象为中心的方式描述图像，强调可见对象。但是图像描述也可以从对象中抽象出来，并描述所描绘的场景类型。在本文中，我们探讨了最先进的视觉和语言模型VINVL的潜力，使用（1）新颖的数据集将图像与以对象为中心和场景描述配对。通过（2）对微调效应的深入分析，我们表明（3）少量的策划数据足以生成场景描述，而不会失去识别场景中对象级概念的能力；与以对象为中心的描述相比，该模型获得了图像的更全面视图。我们讨论这些结果与计算和认知科学研究的见解之间的相似之处。

Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state-of-the-art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.

下载PDF全文

下载文献需遵守相关版权规定

论文标题