不同的图像字幕用接地风格字幕

论文标题

不同的图像字幕用接地风格字幕

Diverse Image Captioning with Grounded Style

论文作者

Klein, Franz, Mahajan, Shweta, Roth, Stefan

论文摘要

先前工作中介绍的程式化的图像字幕旨在生成标题，这些字幕反映了场景构图的事实描述（例如情感）之外的特征。此类先前的工作依赖于给定的情感标识符，该标识符用于表达标题中的某种全球风格，例如但是，正面或负面，但不考虑视觉场景的风格内容。为了解决这个缺点，我们首先分析当前风格化字幕数据集的局限性，并提出基于可可属性的增强量，以从可可注释中获取各种风格化的字幕。此外，我们在各种自动编码器的潜在空间中编码了风格化的信息；具体而言，我们利用提取的图像属性根据不同的局部样式特征明确地构建其顺序潜在空间。我们在Senticap和可可数据集上的实验表明，我们的方法能够生成具有基于图像中的样式多样性的准确字幕。

Stylized image captioning as presented in prior work aims to generate captions that reflect characteristics beyond a factual description of the scene composition, such as sentiments. Such prior work relies on given sentiment identifiers, which are used to express a certain global style in the caption, e.g. positive or negative, however without taking into account the stylistic content of the visual scene. To address this shortcoming, we first analyze the limitations of current stylized captioning datasets and propose COCO attribute-based augmentations to obtain varied stylized captions from COCO annotations. Furthermore, we encode the stylized information in the latent space of a Variational Autoencoder; specifically, we leverage extracted image attributes to explicitly structure its sequential latent space according to different localized style characteristics. Our experiments on the Senticap and COCO datasets show the ability of our approach to generate accurate captions with diversity in styles that are grounded in the image.

下载PDF全文

下载文献需遵守相关版权规定

论文标题