论文标题
通过蒸馏图像文本匹配模型的更接地图像字幕
More Grounded Image Captioning by Distilling Image-Text Matching Model
论文作者
论文摘要
视觉注意力不仅可以提高图像标题的性能,而且还可以作为视觉解释,以定性地测量字幕理性和模型透明度。具体来说,我们希望字幕仪可以在生成相应的单词时将其细心的目光固定在正确的对象上。此功能也称为接地图像字幕。但是,现有字幕符的基础准确性远非令人满意。为了提高接地准确性,同时保留字幕质量,将单词区域对准作为强大的监督是昂贵的。为此,我们提出了一个语音(POS)增强的图像文本匹配模型(scan \ cite {lee2018-stacked}):pos-scan,作为有效的知识蒸馏,用于更接地的图像字幕。好处是两倍:1)给定句子和图像,POS扫描比扫描更准确地将物体扎根; 2)POS扫描是字幕仪的视觉注意模块的单词区域对齐正则化。通过显示基准实验结果,我们证明了配备POS扫描的常规图像标题可以显着提高接地精度而无需强大的监督。最后但并非最不重要的一点是,我们探索了必不可少的自我批判序列训练(SCST)\ cite {Rennie_2017_cvpr}在接地的图像字幕的上下文中,并证明image-Text匹配得分可以作为更接地的\脚注\脚注{
Visual attention not only improves the performance of image captioners, but also serves as a visual interpretation to qualitatively measure the caption rationality and model transparency. Specifically, we expect that a captioner can fix its attentive gaze on the correct objects while generating the corresponding words. This ability is also known as grounded image captioning. However, the grounding accuracy of existing captioners is far from satisfactory. To improve the grounding accuracy while retaining the captioning quality, it is expensive to collect the word-region alignment as strong supervision. To this end, we propose a Part-of-Speech (POS) enhanced image-text matching model (SCAN \cite{lee2018stacked}): POS-SCAN, as the effective knowledge distillation for more grounded image captioning. The benefits are two-fold: 1) given a sentence and an image, POS-SCAN can ground the objects more accurately than SCAN; 2) POS-SCAN serves as a word-region alignment regularization for the captioner's visual attention module. By showing benchmark experimental results, we demonstrate that conventional image captioners equipped with POS-SCAN can significantly improve the grounding accuracy without strong supervision. Last but not the least, we explore the indispensable Self-Critical Sequence Training (SCST) \cite{Rennie_2017_CVPR} in the context of grounded image captioning and show that the image-text matching score can serve as a reward for more grounded captioning \footnote{https://github.com/YuanEZhou/Grounded-Image-Captioning}.