论文标题
多式联运一击的语音和图像匹配的无监督与转移学习
Unsupervised vs. transfer learning for multimodal one-shot matching of speech and images
论文作者
论文摘要
我们考虑多模式一击语音图像匹配的任务。向代理显示图片以及一个描述图片中对象的口语单词,例如饼干,西兰花和冰淇淋。在观察到每个课程的一个配对的语音图像示例之后,它显示了一组新的看不见的图片,并要求选择“冰淇淋”。以前的工作试图使用转移学习来解决此问题:监督模型经过标记的背景数据培训,该数据不包含任何一类类别。在这里,我们将转移学习与接受未标记内域数据训练的无监督模型进行了比较。在配对隔离的口语和视觉数字的数据集中,我们专门将无监督的自动编码器样模型与监督分类器和暹罗神经网络进行了比较。在单峰和多模式的几局匹配实验中,我们发现转移学习的表现优于无监督的训练。我们还提出了结合两种方法的实验,但是发现转移学习仍然表现最好(尽管理想化的实验显示了无监督学习的好处)。
We consider the task of multimodal one-shot speech-image matching. An agent is shown a picture along with a spoken word describing the object in the picture, e.g. cookie, broccoli and ice-cream. After observing one paired speech-image example per class, it is shown a new set of unseen pictures, and asked to pick the "ice-cream". Previous work attempted to tackle this problem using transfer learning: supervised models are trained on labelled background data not containing any of the one-shot classes. Here we compare transfer learning to unsupervised models trained on unlabelled in-domain data. On a dataset of paired isolated spoken and visual digits, we specifically compare unsupervised autoencoder-like models to supervised classifier and Siamese neural networks. In both unimodal and multimodal few-shot matching experiments, we find that transfer learning outperforms unsupervised training. We also present experiments towards combining the two methodologies, but find that transfer learning still performs best (despite idealised experiments showing the benefits of unsupervised learning).