挤压和兴奋和基于变压器的跨任务系统，用于环境声音识别

论文标题

挤压和兴奋和基于变压器的跨任务系统，用于环境声音识别

A Squeeze-and-Excitation and Transformer based Cross-task System for Environmental Sound Recognition

论文作者

Bai, Jisheng, Chen, Jianfeng, Wang, Mou, Ayub, Muhammad Saad

论文摘要

环境声音识别（ESR）是音频模式识别的新兴研究主题。提出了许多任务，以诉诸现实生活应用中的ESR计算模型。但是，当前的模型通常是针对单个任务设计的，并且不健壮且适用于其他任务。尚未对跨各种任务进行统一知识建模的跨任务模型进行了彻底的研究。在本文中，我们为ESR的三个不同任务提出了一个交叉任务模型：1）声学场景分类； 2）城市声音标签； 3）异常的声音检测。提出了一种名为SE-Trans的架构，该体系结构使用基于注意机制的挤压和激发和变压器编码器模块来学习声学特征的通道关系和时间依赖性。 FMIX用作提高ESR性能的数据增强方法。这三个任务的评估是在最新的声学场景和事件挑战的检测和分类数据库上进行的。实验结果表明，所提出的跨任务模型在所有任务上都达到了最先进的性能。进一步的分析表明，所提出的跨任务模型可以有效地利用不同ESR任务的声学知识。

Environmental sound recognition (ESR) is an emerging research topic in audio pattern recognition. Many tasks are presented to resort to computational models for ESR in real-life applications. However, current models are usually designed for individual tasks, and are not robust and applicable to other tasks. Cross-task models, which promote unified knowledge modeling across various tasks, have not been thoroughly investigated. In this article, we propose a cross-task model for three different tasks of ESR: 1) acoustic scene classification; 2) urban sound tagging; and 3) anomalous sound detection. An architecture named SE-Trans is presented that uses attention mechanism-based Squeeze-and-Excitation and Transformer encoder modules to learn the channelwise relationship and temporal dependencies of the acoustic features. FMix is employed as the data augmentation method that improves the performance of ESR. Evaluations for the three tasks are conducted on the recent databases of detection and classification of acoustic scenes and event challenges. The experimental results show that the proposed cross-task model achieves state-of-the-art performance on all tasks. Further analysis demonstrates that the proposed cross-task model can effectively utilize acoustic knowledge across different ESR tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题