朝着视觉上提示的关键字本地化用于零资源的口语

论文标题

朝着视觉上提示的关键字本地化用于零资源的口语

Towards visually prompted keyword localisation for zero-resource spoken languages

论文作者

Nortje, Leanne, Kamper, Herman

论文摘要

想象一下，能够向系统展示关键字的视觉描述，并找到包含零资源语音语料库的关键字的口语话语。我们对此任务进行正式化，并将其称为视觉提示的关键字本地化（VPKL）：给定关键字的图像，检测并预测关键字发生的话语。为了进行VPKL，我们提出了一个具有新颖的本地注意机制的语音视觉模型，我们使用新的关键字采样方案进行训练。我们表明，这些创新比现有的语音视觉模型可以改善VPKL。我们还与视觉袋（BOW）模型进行比较，其中图像会自动用视觉标签标记，并与未标记的语音配对。尽管可以用书面关键字直接查询此视觉弓（而我们的图像查询），但我们的新模型仍然优于检测和本地化的视觉弓，从而在本地化F1方面相对改善16％。

Imagine being able to show a system a visual depiction of a keyword and finding spoken utterances that contain this keyword from a zero-resource speech corpus. We formalise this task and call it visually prompted keyword localisation (VPKL): given an image of a keyword, detect and predict where in an utterance the keyword occurs. To do VPKL, we propose a speech-vision model with a novel localising attention mechanism which we train with a new keyword sampling scheme. We show that these innovations give improvements in VPKL over an existing speech-vision model. We also compare to a visual bag-of-words (BoW) model where images are automatically tagged with visual labels and paired with unlabelled speech. Although this visual BoW can be queried directly with a written keyword (while our's takes image queries), our new model still outperforms the visual BoW in both detection and localisation, giving a 16% relative improvement in localisation F1.

下载PDF全文

下载文献需遵守相关版权规定

论文标题