学习开放式视频关键字发现的音频文本协议

论文标题

学习开放式视频关键字发现的音频文本协议

Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

论文作者

Shin, Hyeon-Kyeong, Han, Hyewon, Kim, Doyeon, Chung, Soo-Whan, Kang, Hong-Goo

论文摘要

在本文中，我们提出了一种新颖的端到端用户定义的关键字发现方法，该方法利用语音和文本序列之间的语言相应模式。与以前需要语音关键字注册的方法不同，我们的方法将输入查询与注册文本关键字序列进行比较。为了将音频和文本表示形式放置在一个共同的潜在空间中，我们采用了一种基于注意力的跨模式匹配方法，该方法以端到端的方式进行了训练，并具有单调匹配的损失和关键字分类损失。我们还利用了声学嵌入网络的拖延损失来改善嘈杂环境中的鲁棒性。此外，我们介绍了Libriphrase数据集，这是一种基于LibrisPeech的新短语数据集，用于有效训练关键字斑点模型。与其他单模式和跨模式基准相比，我们提出的方法在各种评估集上取得了竞争成果。

In this paper, we propose a novel end-to-end user-defined keyword spotting method that utilizes linguistically corresponding patterns between speech and text sequences. Unlike previous approaches requiring speech keyword enrollment, our method compares input queries with an enrolled text keyword sequence. To place the audio and text representations within a common latent space, we adopt an attention-based cross-modal matching approach that is trained in an end-to-end manner with monotonic matching loss and keyword classification loss. We also utilize a de-noising loss for the acoustic embedding network to improve robustness in noisy environments. Additionally, we introduce the LibriPhrase dataset, a new short-phrase dataset based on LibriSpeech for efficiently training keyword spotting models. Our proposed method achieves competitive results on various evaluation sets compared to other single-modal and cross-modal baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题