对一般音频表示的弱监督音频标记嵌入的实证研究

论文标题

对一般音频表示的弱监督音频标记嵌入的实证研究

An empirical study of weakly supervised audio tagging embeddings for general audio representations

论文作者

Dinkel, Heinrich, Yan, Zhiyong, Wang, Yongqing, Zhang, Junbo, Wang, Yujun

论文摘要

我们研究预先训练的弱监督音频标记（AT）模型的可用性，作为一般音频表示的特征提取器。我们主要分析将这些嵌入到语音和声音域内其他任务中的可行性。具体而言，我们针对现代的自我监督学习方法（BYOL-A）作为特征提取器对弱监督的预训练模型（MobileNetV2和ExtricNet-B0）进行了基准测试。 14个下游任务用于评估从音乐仪器分类到语言分类。我们的结果表明，在预训练的模型中，是音乐，活动和情感识别任务的出色转移学习选择。此外，模型的填充还可以使与语音相关的任务（例如关键字发现和意图分类）受益。

We study the usability of pre-trained weakly supervised audio tagging (AT) models as feature extractors for general audio representations. We mainly analyze the feasibility of transferring those embeddings to other tasks within the speech and sound domains. Specifically, we benchmark weakly supervised pre-trained models (MobileNetV2 and EfficientNet-B0) against modern self-supervised learning methods (BYOL-A) as feature extractors. Fourteen downstream tasks are used for evaluation ranging from music instrument classification to language classification. Our results indicate that AT pre-trained models are an excellent transfer learning choice for music, event, and emotion recognition tasks. Further, finetuning AT models can also benefit speech-related tasks such as keyword spotting and intent classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题