论文标题

理解和订购图像字幕的语义

Comprehending and Ordering Semantics for Image Captioning

论文作者

Li, Yehao, Pan, Yingwei, Yao, Ting, Mei, Tao

论文摘要

在图像中理解丰富的语义并按语言顺序订购它们对于构成图像字幕的视觉接地和语言连贯的描述至关重要。现代技术通常会大写训练的对象探测器/分类器,以在图像中挖掘语义,同时留下固有的语言序列的语义序列。在本文中,我们提出了一个新的型变压器式结构的食谱,即理解和订购语义网络(COS-NET),它在新颖地将丰富的语义理解和可学习的语义订购过程统一到单个体系结构中。从技术上讲,我们最初使用跨模式检索模型来搜索每个图像的相关句子,并且搜索句子中的所有单词都被视为主要语义提示。接下来,设计了一种新颖的语义理解者,​​以滤除原始语义提示中无关的语义单词,同时推断出图像中视觉上基于的缺失相关语义单词。之后,我们将所有筛选和丰富的语义单词喂入语义排名,该语义排名学会学会以语言顺序分配所有语义单词。这样的有序语义单词序列与图像的视觉令牌进一步集成在一起,以触发句子的产生。经验证据表明,COS-NET清楚地超过了可可的最新方法,并在karpathy测试分开方面达到了最佳的苹果酒评分141.1%。源代码可在\ url {https://github.com/yehli/xmodaler/tree/master/master/configs/image_caption/cosnet}获得。

Comprehending the rich semantics in an image and ordering them in linguistic order are essential to compose a visually-grounded and linguistically coherent description for image captioning. Modern techniques commonly capitalize on a pre-trained object detector/classifier to mine the semantics in an image, while leaving the inherent linguistic ordering of semantics under-exploited. In this paper, we propose a new recipe of Transformer-style structure, namely Comprehending and Ordering Semantics Networks (COS-Net), that novelly unifies an enriched semantic comprehending and a learnable semantic ordering processes into a single architecture. Technically, we initially utilize a cross-modal retrieval model to search the relevant sentences of each image, and all words in the searched sentences are taken as primary semantic cues. Next, a novel semantic comprehender is devised to filter out the irrelevant semantic words in primary semantic cues, and meanwhile infer the missing relevant semantic words visually grounded in the image. After that, we feed all the screened and enriched semantic words into a semantic ranker, which learns to allocate all semantic words in linguistic order as humans. Such sequence of ordered semantic words are further integrated with visual tokens of images to trigger sentence generation. Empirical evidences show that COS-Net clearly surpasses the state-of-the-art approaches on COCO and achieves to-date the best CIDEr score of 141.1% on Karpathy test split. Source code is available at \url{https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/cosnet}.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源