剪辑扩散LM：在图像字幕上应用扩散模型

论文标题

剪辑扩散LM：在图像字幕上应用扩散模型

CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning

论文作者

Xu, Shitong

论文摘要

图像字幕任务已通过先前的工作进行了广泛的研究。但是，有限的实验专注于基于非自动回归文本解码器生成字幕。受denoising扩散模型在图像合成任务上的最新成功的启发，我们将deNoising扩散概率模型应用于图像字幕任务中的文本生成。我们表明，与自回归模型相比，我们的剪辑扩散LM能够使用明显更少的推理步骤生成图像字幕。在FlickR8K数据集上，该模型达到0.1876 BLEU-4得分。通过在联合FlickR8K和FlickR30K数据集上进行培训，我们的模型可实现0.2470 BLEU-4得分。我们的代码可从https://github.com/xu-shitong/diffusion-image-captioning获得。

Image captioning task has been extensively researched by previous work. However, limited experiments focus on generating captions based on non-autoregressive text decoder. Inspired by the recent success of the denoising diffusion model on image synthesis tasks, we apply denoising diffusion probabilistic models to text generation in image captioning tasks. We show that our CLIP-Diffusion-LM is capable of generating image captions using significantly fewer inference steps than autoregressive models. On the Flickr8k dataset, the model achieves 0.1876 BLEU-4 score. By training on the combined Flickr8k and Flickr30k dataset, our model achieves 0.2470 BLEU-4 score. Our code is available at https://github.com/xu-shitong/diffusion-image-captioning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题