改进视觉字幕的融合模型

论文标题

改进视觉字幕的融合模型

Fusion Models for Improved Visual Captioning

论文作者

Kalimuthu, Marimuthu, Mogadala, Aditya, Mosbach, Marius, Klakow, Dietrich

论文摘要

视觉字幕旨在生成给定图像或视频的文本描述。传统上，图像字幕模型在人类注释的数据集上进行了培训，例如Flickr30k和MS-Coco，它们的大小和多样性有限。这种限制阻碍了这些模型的概括能力，同时也使它们有犯错。但是，可以对大量免费获得的无标记数据进行培训，并已成为成功的语言编码器和连贯的文本生成器。同时，已经证明了几种单峰和多模式融合技术可以很好地适合自然语言的产生和自动语音识别。基于这些最新的发展，为了提高生成字幕的质量，本文的作品贡献是两个方面的：首先，我们提出了一个通用的多式模型模型融合框架，用于字幕生成以及通过不同的融合策略在传统的辅助语言模型（Auxlm）中，我们利用不同的融合策略在传统的辅助模型（auxlm）中进行了视频编号。接下来，我们采用相同的融合策略来整合验证的蒙版语言模型（MLM），即Bert，即视觉字幕模型，即。展示，参加和讲述，以说明字幕中的句法和语义错误。我们的字幕修正实验在三个基准图像字幕数据集上，即。 FlickR8K，FlickR30K和Mscoco显示出比基线的改进，表明我们提出的多模式融合策略的有用性。此外，我们对修复字幕进行初步定性分析，并根据校正类型确定错误类别。

Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them liable to making mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders and coherent text generators. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation and automatic speech recognition. Building on these recent developments, and with the aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder visual captioning frameworks. Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz. Show, Attend, and Tell, for emending both syntactic and semantic errors in captions. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline, indicating the usefulness of our proposed multimodal fusion strategies. Further, we perform a preliminary qualitative analysis on the emended captions and identify error categories based on the type of corrections.

下载PDF全文

下载文献需遵守相关版权规定

论文标题