学习为故事可视化建模多模式语义对齐

论文标题

学习为故事可视化建模多模式语义对齐

Learning to Model Multimodal Semantic Alignment for Story Visualization

论文作者

Li, Bowen, Lukasiewicz, Thomas

论文摘要

故事可视化旨在生成一系列图像，以叙述一个多句子的故事中的每个句子，其中图像应该是现实的，并在动态场景和角色之间保持全球一致性。当前的作品由于其固定的架构和输入方式的多样性而面临着语义错位的问题。为了解决这个问题，我们通过学习在基于GAN的生成模型中匹配其语义级别来探索文本和图像表示之间的语义一致性。更具体地说，我们根据学习为动态探索各种语义深度并在匹配的语义层面上融合不同模式信息的动态交互作用，从而缓解了文本图像映像语义失调问题。在不同数据集上进行的广泛实验证明了我们方法的改进，与最先进的方法相比，与图像质量和故事一致性相比，既不使用分割掩码，也不使用辅助字幕网络。

Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story, where the images should be realistic and keep global consistency across dynamic scenes and characters. Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities. To address this problem, we explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model. More specifically, we introduce dynamic interactions according to learning to dynamically explore various semantic depths and fuse the different-modal information at a matched semantic level, which thus relieves the text-image semantic misalignment problem. Extensive experiments on different datasets demonstrate the improvements of our approach, neither using segmentation masks nor auxiliary captioning networks, on image quality and story consistency, compared with state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题