论文标题
跨模式概括:通过元对准以低资源方式学习
Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment
论文作者
论文摘要
自然世界充满了通过视觉,声学,触觉和语言方式表达的概念。但是,多模式学习中的许多现有进展主要集中在火车和测试时间出现相同方式的问题上,这使得以低资源方式学习特别困难。在这项工作中,我们提出了用于跨模式概括的算法:一种学习范式来训练一个模型,该模型可以(1)在目标模式(即元学习)中快速执行新任务,并且(2)在接受其他源模式的同时进行培训。我们研究了一个关键的研究问题:尽管对不同的来源和目标方式使用了单独的编码器,但我们如何确保跨模式的概括?我们的解决方案基于元对准,这是一种使用强度和弱配对的跨模式数据对齐表示空间的新方法,同时确保对不同方式的新任务进行快速概括。我们在三个分类任务上研究此问题:文本到图像,图像到音频以及文本到语音。我们的结果表明,即使新目标模式只有几个标记样品(1-10)标签,并且在嘈杂的标签的情况下,这种情况尤其是在低资源模态下普遍存在的情况。
The natural world is abundant with concepts expressed via visual, acoustic, tactile, and linguistic modalities. Much of the existing progress in multimodal learning, however, focuses primarily on problems where the same set of modalities are present at train and test time, which makes learning in low-resource modalities particularly difficult. In this work, we propose algorithms for cross-modal generalization: a learning paradigm to train a model that can (1) quickly perform new tasks in a target modality (i.e. meta-learning) and (2) doing so while being trained on a different source modality. We study a key research question: how can we ensure generalization across modalities despite using separate encoders for different source and target modalities? Our solution is based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data while ensuring quick generalization to new tasks across different modalities. We study this problem on 3 classification tasks: text to image, image to audio, and text to speech. Our results demonstrate strong performance even when the new target modality has only a few (1-10) labeled samples and in the presence of noisy labels, a scenario particularly prevalent in low-resource modalities.