音频事件检测神经体系结构的鲁棒性

论文标题

音频事件检测神经体系结构的鲁棒性

Robustness of Neural Architectures for Audio Event Detection

论文作者

Li, Juncheng B, Wang, Zheng, Qu, Shuhui, Metze, Florian

论文摘要

传统上，在音频识别管道中，依赖于诸如语音增强的预处理技术的“前端”抑制了噪声。但是，不能保证噪声不会级联进入下游管道。为了了解噪声对整个音频管道的实际影响，在本文中，我们直接研究了噪声对不同类型的神经模型的影响，而无需预处理步骤。我们在三种噪声下，在环境声音分类的任务上测量了4种不同的神经网络模型的识别性能：\ emph {cacklusion}（仿真间歇性噪声），\ emph {gaussian}噪声（模型连续噪声）和\ emph {versarialialialIlial pertturbations}（verturbations}（wort caseario）。我们的直觉是，这些模型处理其输入的不同方式（即，CNN具有强大的局部归纳偏见，而变形金刚没有）会导致可观察到的性能和/或稳健性的差异，对此将实现进一步的改进。我们对音频集进行了广泛的实验，该实验是可用的最大标记的声音事件数据集。我们还试图通过输出分布变化和重量可视化来解释不同模型的行为。

Traditionally, in Audio Recognition pipeline, noise is suppressed by the "frontend", relying on preprocessing techniques such as speech enhancement. However, it is not guaranteed that noise will not cascade into downstream pipelines. To understand the actual influence of noise on the entire audio pipeline, in this paper, we directly investigate the impact of noise on a different types of neural models without the preprocessing step. We measure the recognition performances of 4 different neural network models on the task of environment sound classification under the 3 types of noises: \emph{occlusion} (to emulate intermittent noise), \emph{Gaussian} noise (models continuous noise), and \emph{adversarial perturbations} (worst case scenario). Our intuition is that the different ways in which these models process their input (i.e. CNNs have strong locality inductive biases, which Transformers do not have) should lead to observable differences in performance and/ or robustness, an understanding of which will enable further improvements. We perform extensive experiments on AudioSet which is the largest weakly-labeled sound event dataset available. We also seek to explain the behaviors of different models through output distribution change and weight visualization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题