图像字幕用基于文本方式的视觉对象表示形式

论文标题

图像字幕用基于文本方式的视觉对象表示形式

Image Captioning with Visual Object Representations Grounded in the Textual Modality

论文作者

Variš, Dušan, Sudoh, Katsuhito, Nakamura, Satoshi

论文摘要

我们介绍了正在进行的工作，以探讨文本和视觉模态之间共享嵌入空间的可能性。利用对象检测标签的文本性质以及提取的视觉对象表示的假设表达性，我们提出了一种与当前趋势相反的方法，在字幕系统的嵌入空间中的表示形式接地，而不是与其相关图像中的单词或句子接地。基于先前的工作，我们将额外的接地损失应用于图像字幕训练目标，旨在迫使视觉对象表示基于其类标签创建更异构的簇，并复制单词嵌入空间的语义结构。此外，我们还对学习的对象向量空间投影及其对IC系统性能的影响进行分析。由于性能只有轻微的变化，基型模型在训练过程中达到停止标准的速度比无约束的模型更快，需要减少两到三倍的培训更新。此外，单词嵌入与原始对象向量之间的结构相关性的改善表明，接地实际上是相互的。

We present our work in progress exploring the possibilities of a shared embedding space between textual and visual modality. Leveraging the textual nature of object detection labels and the hypothetical expressiveness of extracted visual object representations, we propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system instead of grounding words or sentences in their associated images. Based on the previous work, we apply additional grounding losses to the image captioning training objective aiming to force visual object representations to create more heterogeneous clusters based on their class label and copy a semantic structure of the word embedding space. In addition, we provide an analysis of the learned object vector space projection and its impact on the IC system performance. With only slight change in performance, grounded models reach the stopping criterion during training faster than the unconstrained model, needing about two to three times less training updates. Additionally, an improvement in structural correlation between the word embeddings and both original and projected object vectors suggests that the grounding is actually mutual.

下载PDF全文

下载文献需遵守相关版权规定

论文标题