unikw-at：统一关键字发现和音频标记

论文标题

unikw-at：统一关键字发现和音频标记

UniKW-AT: Unified Keyword Spotting and Audio Tagging

论文作者

Dinkel, Heinrich, Wang, Yongqing, Yan, Zhiyong, Zhang, Junbo, Wang, Yujun

论文摘要

在音频研究社区和行业中，关键字发现（KWS）和音频标签（AT）被视为两个不同的任务和研究领域。但是，从技术的角度来看，这两个任务都是相同的：它们可以预测一些固定尺寸的输入音频段的标签（KWS中的关键字，在AT上进行声音事件）。这项工作提出了unikw-at：共同培训KWS和AT的初步方法。 UNIKW-AT增强了KWS的噪声稳定性，同时还可以预测特定的声音事件并在声音事件上有条件唤醒。我们的方法通过描述关键字的存在的其他标签扩展了AT管道。实验是在Google语音命令V1（GSCV1）和平衡音频集（AS）数据集上进行的。所提出的MobilenETV2模型在GSCV1数据集上的准确度为97.53％，在AS评估集上获得了33.4的地图。此外，我们表明可以在现实世界中的KWS数据集中观察到显着的噪声增长，这表现非常优于标准KWS方法。我们的研究表明，可以将KW和AT合并为一个框架，而不会出现明显的性能降解。

Within the audio research community and the industry, keyword spotting (KWS) and audio tagging (AT) are seen as two distinct tasks and research fields. However, from a technical point of view, both of these tasks are identical: they predict a label (keyword in KWS, sound event in AT) for some fixed-sized input audio segment. This work proposes UniKW-AT: An initial approach for jointly training both KWS and AT. UniKW-AT enhances the noise-robustness for KWS, while also being able to predict specific sound events and enabling conditional wake-ups on sound events. Our approach extends the AT pipeline with additional labels describing the presence of a keyword. Experiments are conducted on the Google Speech Commands V1 (GSCV1) and the balanced Audioset (AS) datasets. The proposed MobileNetV2 model achieves an accuracy of 97.53% on the GSCV1 dataset and an mAP of 33.4 on the AS evaluation set. Further, we show that significant noise-robustness gains can be observed on a real-world KWS dataset, greatly outperforming standard KWS approaches. Our study shows that KWS and AT can be merged into a single framework without significant performance degradation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题