论文标题
直接多模式的语音和图像学习
Direct multimodal few-shot learning of speech and images
论文作者
论文摘要
我们提出了直接的多模式模型,该模型仅从几个配对的示例中学习一个共享的口语单词和图像的嵌入空间。想象一下,向代理显示图像以及一个描述图片中对象的口语单词,例如笔,书籍和橡皮擦。在观察了每个班级的几个配对示例之后,要求模型在一组看不见的图片中识别“书”。先前的工作使用了两步间接方法,依靠学习的单峰表示:语音语音和图像图像比较是在给定语音图像对的整个支持集中进行的。我们提出了两个直接模型,这些模型将学习一个多模式的空间,其中来自不同模式的输入是直接可比的:多模式三重态网络(MTRIPLET)和一个多模式通信自动编码器(MCAE)。为了训练这些直接模型,我们挖掘了语音图像对:支持集用于配对未标记的内域语音和图像。在语音到图像数字匹配的任务中,直接模型超过间接模型,Mtriplet达到了最佳的多模式五局精度。我们表明,这些改进是由于无监督和转移学习在直接模型中的结合,以及缺乏两步的复合误差。
We propose direct multimodal few-shot models that learn a shared embedding space of spoken words and images from only a few paired examples. Imagine an agent is shown an image along with a spoken word describing the object in the picture, e.g. pen, book and eraser. After observing a few paired examples of each class, the model is asked to identify the "book" in a set of unseen pictures. Previous work used a two-step indirect approach relying on learned unimodal representations: speech-speech and image-image comparisons are performed across the support set of given speech-image pairs. We propose two direct models which instead learn a single multimodal space where inputs from different modalities are directly comparable: a multimodal triplet network (MTriplet) and a multimodal correspondence autoencoder (MCAE). To train these direct models, we mine speech-image pairs: the support set is used to pair up unlabelled in-domain speech and images. In a speech-to-image digit matching task, direct models outperform indirect models, with the MTriplet achieving the best multimodal five-shot accuracy. We show that the improvements are due to the combination of unsupervised and transfer learning in the direct models, and the absence of two-step compounding errors.