论文标题
基于情绪的端到端匹配在价值空间中的图像和音乐之间
Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space
论文作者
论文摘要
图像和音乐都可以传达丰富的语义,并广泛用于引起特定的情绪。匹配的图像和音乐具有相似的情感可能有助于使情绪观念更加生动和强烈。现有的基于情感的图像和音乐匹配方法要么采用有限的分类情绪状态,这些状态无法很好地反映情绪的复杂性和微妙性,要么使用不切实际的多阶段管道训练匹配模型。在本文中,我们根据连续的价值(VA)空间中的情绪研究图像和音乐之间的端到端匹配。首先,我们构建一个大型数据集,称为图像 - 音乐 - 匹配网络(IMENMENT),具有超过140k的图像音乐对。其次,我们建议跨模式深连续度量学习(CDCML)学习共享的潜在嵌入空间,该空间保留了连续匹配空间中的跨模式相似性关系。最后,我们通过进一步保留图像和音乐的VA空间中的单模式情感关系来完善嵌入空间。嵌入空间中的度量学习和标签空间中的任务回归是共同优化的,用于跨模式匹配和单模式VA预测。与最先进的方法相比,对Imemnet进行的广泛实验证明了CDCML对基于情绪的图像和音乐匹配的优越性。
Both images and music can convey rich semantics and are widely used to induce specific emotions. Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger. Existing emotion-based image and music matching methods either employ limited categorical emotion states which cannot well reflect the complexity and subtlety of emotions, or train the matching model using an impractical multi-stage pipeline. In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space. First, we construct a large-scale dataset, termed Image-Music-Emotion-Matching-Net (IMEMNet), with over 140K image-music pairs. Second, we propose cross-modal deep continuous metric learning (CDCML) to learn a shared latent embedding space which preserves the cross-modal similarity relationship in the continuous matching space. Finally, we refine the embedding space by further preserving the single-modal emotion relationship in the VA spaces of both images and music. The metric learning in the embedding space and task regression in the label space are jointly optimized for both cross-modal matching and single-modal VA prediction. The extensive experiments conducted on IMEMNet demonstrate the superiority of CDCML for emotion-based image and music matching as compared to the state-of-the-art approaches.