论文标题
视觉上启动的语言建模
Visually-Augmented Language Modeling
论文作者
论文摘要
人类语言基于多模式知识,包括视觉知识,例如颜色,大小和形状。但是,当前的大规模预训练的语言模型依赖于仅使用大量文本数据的纯文本自我监督培训,这使他们无法在必要时使用相关的视觉信息。为了解决这个问题,我们提出了一个名为Valm的新型预训练框架,以视觉上的文本令牌,并带有用于语言建模的相关图像。具体而言,Valm通过图像检索模块以新颖的潜在文本图像对齐方式构建,以获取给定文本上下文的相应图像。借助视觉上的上下文,Valm使用视觉知识融合层,通过参与图像中的文本上下文和视觉知识来启用多模式接地语言建模。我们评估Valm在各种视觉知识密集的常识性推理任务上,这些任务需要视觉信息才能表现出色。实验结果表明,瓦尔姆(Valm)的表现胜过所有强大的语言和视觉语言基线,在推理对象常识中具有可观的收益,包括颜色,大小和形状。我们的代码可在https://github.com/victorwz/valm上找到。
Human language is grounded on multimodal knowledge including visual knowledge like colors, sizes, and shapes. However, current large-scale pre-trained language models rely on text-only self-supervised training with massive text data, which precludes them from utilizing relevant visual information when necessary. To address this, we propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling. Specifically, VaLM builds on a novel latent text-image alignment method via an image retrieval module to fetch corresponding images given a textual context. With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling by attending to both text context and visual knowledge in images. We evaluate VaLM on various visual knowledge-intensive commonsense reasoning tasks, which require visual information to excel. The experimental results illustrate that VaLM outperforms all strong language-only and vision-language baselines with substantial gains in reasoning object commonsense including color, size, and shape. Our code is available at https://github.com/Victorwz/VaLM.