论文标题
用Peppa Pig学习英语
Learning English with Peppa Pig
论文作者
论文摘要
通过基础,通过感知利用语言和视觉方式之间的关联来获取口语的最新计算模型,并学会在关节矢量空间中代表语音和视觉数据。从生态有效性的角度来看,尚未解决的一个主要问题是训练数据,通常由图像或视频与所描绘内容的口头描述组成。这样的设置保证了语音与视觉数据之间的不切实际的强度相关性。在现实世界中,语言和视觉方式之间的耦合是松散的,并且通常与语音信号的非语义方面相关。在这里,我们通过使用基于儿童卡通Peppa猪的数据集来解决这一缺点。我们训练一个简单的双模式体系结构,这些架构是由字符之间的对话组成的数据部分,并在包含描述性叙述的段上进行评估。尽管在此培训数据中存在薄弱和混杂的信号,但我们的模型还是成功地学习了口语语言的视觉语义方面。
Recent computational models of the acquisition of spoken language via grounding in perception exploit associations between the spoken and visual modalities and learn to represent speech and visual data in a joint vector space. A major unresolved issue from the point of ecological validity is the training data, typically consisting of images or videos paired with spoken descriptions of what is depicted. Such a setup guarantees an unrealistically strong correlation between speech and the visual data. In the real world the coupling between the linguistic and the visual modality is loose, and often confounded by correlations with non-semantic aspects of the speech signal. Here we address this shortcoming by using a dataset based on the children's cartoon Peppa Pig. We train a simple bi-modal architecture on the portion of the data consisting of dialog between characters, and evaluate on segments containing descriptive narrations. Despite the weak and confounded signal in this training data our model succeeds at learning aspects of the visual semantics of spoken language.