视觉+X：根据数据的多模式学习调查

论文标题

视觉+X：根据数据的多模式学习调查

Vision+X: A Survey on Multimodal Learning in the Light of Data

论文作者

Zhu, Ye, Wu, Yu, Sebe, Nicu, Yan, Yan

论文摘要

我们以多感觉的方式感知和与世界沟通，在这些方式中，不同的信息源被人脑的各个部分进行了复杂的处理和解释，以构成一个复杂但又和谐而统一的传感系统。为了赋予机器的真实智能，将来自各种来源数据结合的多模式的机器学习已成为一个越来越受欢迎的研究领域，近年来随着新兴技术进步。在本文中，我们从新颖的角度提出了一项关于多模式机器学习的调查，不仅考虑了纯粹的技术方面，而且考虑了不同数据模式的内在性质。我们分析了每种数据格式的共同点和唯一性，主要是视觉，音频，文本和动作，然后介绍通过数据模态的组合（例如Vision+Text）分类的方法学进步，例如视觉+文本，并稍微倾向于视觉数据。我们从表示学习和下游应用程序级别研究了有关多模式学习的现有文献，并根据其技术联系与数据性质（例如，图像对象与文本描述之间的语义一致性以及视频舞蹈动作与音乐节奏之间的节奏对应关系）提供了额外的比较。我们希望对数据模式的内在性质和技术设计的内在性质之间的剥削以及现有的差距，将使未来的研究有益于更好地解决与具体的多模式任务相关的特定挑战，从而促使统一的多模式机器学习框架与真实的人类智能系统更封闭。

We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified sensing system. To endow the machines with true intelligence, multimodal machine learning that incorporates data from various sources has become an increasingly popular research area with emerging technical advances in recent years. In this paper, we present a survey on multimodal machine learning from a novel perspective considering not only the purely technical aspects but also the intrinsic nature of different data modalities. We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions, and then present the methodological advancements categorized by the combination of data modalities, such as Vision+Text, with slightly inclined emphasis on the visual data. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels, and provide an additional comparison in the light of their technical connections with the data nature, e.g., the semantic consistency between image objects and textual descriptions, and the rhythm correspondence between video dance moves and musical beats. We hope that the exploitation of the alignment as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address a specific challenge related to the concrete multimodal task, prompting a unified multimodal machine learning framework closer to a real human intelligence system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题