论文标题
unikw-at:统一关键字发现和音频标记
UniKW-AT: Unified Keyword Spotting and Audio Tagging
论文作者
论文摘要
在音频研究社区和行业中,关键字发现(KWS)和音频标签(AT)被视为两个不同的任务和研究领域。但是,从技术的角度来看,这两个任务都是相同的:它们可以预测一些固定尺寸的输入音频段的标签(KWS中的关键字,在AT上进行声音事件)。这项工作提出了unikw-at:共同培训KWS和AT的初步方法。 UNIKW-AT增强了KWS的噪声稳定性,同时还可以预测特定的声音事件并在声音事件上有条件唤醒。我们的方法通过描述关键字的存在的其他标签扩展了AT管道。实验是在Google语音命令V1(GSCV1)和平衡音频集(AS)数据集上进行的。所提出的MobilenETV2模型在GSCV1数据集上的准确度为97.53%,在AS评估集上获得了33.4的地图。此外,我们表明可以在现实世界中的KWS数据集中观察到显着的噪声增长,这表现非常优于标准KWS方法。我们的研究表明,可以将KW和AT合并为一个框架,而不会出现明显的性能降解。
Within the audio research community and the industry, keyword spotting (KWS) and audio tagging (AT) are seen as two distinct tasks and research fields. However, from a technical point of view, both of these tasks are identical: they predict a label (keyword in KWS, sound event in AT) for some fixed-sized input audio segment. This work proposes UniKW-AT: An initial approach for jointly training both KWS and AT. UniKW-AT enhances the noise-robustness for KWS, while also being able to predict specific sound events and enabling conditional wake-ups on sound events. Our approach extends the AT pipeline with additional labels describing the presence of a keyword. Experiments are conducted on the Google Speech Commands V1 (GSCV1) and the balanced Audioset (AS) datasets. The proposed MobileNetV2 model achieves an accuracy of 97.53% on the GSCV1 dataset and an mAP of 33.4 on the AS evaluation set. Further, we show that significant noise-robustness gains can be observed on a real-world KWS dataset, greatly outperforming standard KWS approaches. Our study shows that KWS and AT can be merged into a single framework without significant performance degradation.