论文标题
通过更好地使用字幕改进图像字幕
Improving Image Captioning with Better Use of Captions
论文作者
论文摘要
图像字幕是一个多模式问题,在自然语言处理和计算机视觉社区中都引起了广泛的关注。在本文中,我们介绍了一个新颖的图像字幕架构,以更好地探索字幕上可用的语义,并利用以增强图像表示和字幕的生成。我们的模型首先构建了字幕引导的视觉关系图,这些图形图表使用弱监督的多构度学习引入有益的电感偏差。然后,用相邻和上下文节点及其文本和视觉特征增强表示形式。在生成期间,该模型进一步使用多任务学习来结合视觉关系,以共同预测单词和对象/谓词标签序列。我们在MSCOCO数据集上进行了广泛的实验,表明所提出的框架显着胜过基线,从而在广泛的评估指标下导致最先进的性能。
Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community. In this paper, we present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. The representation is then enhanced with neighbouring and contextual nodes with their textual and visual features. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences. We perform extensive experiments on the MSCOCO dataset, showing that the proposed framework significantly outperforms the baselines, resulting in the state-of-the-art performance under a wide range of evaluation metrics.