通过更好地使用字幕改进图像字幕

论文标题

通过更好地使用字幕改进图像字幕

Improving Image Captioning with Better Use of Captions

论文作者

Shi, Zhan, Zhou, Xu, Qiu, Xipeng, Zhu, Xiaodan

论文摘要

图像字幕是一个多模式问题，在自然语言处理和计算机视觉社区中都引起了广泛的关注。在本文中，我们介绍了一个新颖的图像字幕架构，以更好地探索字幕上可用的语义，并利用以增强图像表示和字幕的生成。我们的模型首先构建了字幕引导的视觉关系图，这些图形图表使用弱监督的多构度学习引入有益的电感偏差。然后，用相邻和上下文节点及其文本和视觉特征增强表示形式。在生成期间，该模型进一步使用多任务学习来结合视觉关系，以共同预测单词和对象/谓词标签序列。我们在MSCOCO数据集上进行了广泛的实验，表明所提出的框架显着胜过基线，从而在广泛的评估指标下导致最先进的性能。

Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community. In this paper, we present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. The representation is then enhanced with neighbouring and contextual nodes with their textual and visual features. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences. We perform extensive experiments on the MSCOCO dataset, showing that the proposed framework significantly outperforms the baselines, resulting in the state-of-the-art performance under a wide range of evaluation metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题