使用双线性汇总的声学现场分类

论文标题

使用双线性汇总的声学现场分类

Acoustic Scene Classification Using Bilinear Pooling on Time-liked and Frequency-liked Convolution Neural Network

论文作者

Kek, Xing Yong, Chin, Cheng Siong, Li, Ye

论文摘要

可以通过两个步骤描述当前解决声学场景分类（ASC）任务的方法，将音频波形预处理到log-Mel频谱图中，然后将其用作卷积神经网络（CNN）的输入表示。 This paradigm shift occurs after DCASE 2016 where this framework model achieves the state-of-the-art result in ASC tasks on the (ESC-50) dataset and achieved an accuracy of 64.5%, which constitute to 20.5% improvement over the baseline model, and DCASE 2016 dataset with an accuracy of 90.0% (development) and 86.2% (evaluation), which constitute a 6.4% and 9% improvements with respect to the基线系统。在本文中，我们探讨了谐波和打击乐源分离（HPSS）将音频分为谐波音频和打击乐音频的使用，后者在音乐信息检索领域（MIR）获得了知名度。 Although works have been done in using HPSS as input representation for CNN model in ASC task, this paper further investigate the possibility on leveraging the separated harmonic component and percussive component by curating 2 CNNs which tries to understand harmonic audio and percussive audio in their natural form, one specialized in extracting deep features in time biased domain and another specialized in extracting deep features in frequency biased domain, respectively.然后，将使用双线性池合并从这两个CNN中提取的深度特征。因此，在对声学场景进行分类时，提出了两流时间和频率CNN体系结构方法。该模型正在DCASE 2019 SUB任务1A数据集上进行评估，并且在开发数据集，Kaggle Leaderslion Leadership私人和公共董事会中平均得分为65％。

The current methodology in tackling Acoustic Scene Classification (ASC) task can be described in two steps, preprocessing of the audio waveform into log-mel spectrogram and then using it as the input representation for Convolutional Neural Network (CNN). This paradigm shift occurs after DCASE 2016 where this framework model achieves the state-of-the-art result in ASC tasks on the (ESC-50) dataset and achieved an accuracy of 64.5%, which constitute to 20.5% improvement over the baseline model, and DCASE 2016 dataset with an accuracy of 90.0% (development) and 86.2% (evaluation), which constitute a 6.4% and 9% improvements with respect to the baseline system. In this paper, we explored the use of harmonic and percussive source separation (HPSS) to split the audio into harmonic audio and percussive audio, which has received popularity in the field of music information retrieval (MIR). Although works have been done in using HPSS as input representation for CNN model in ASC task, this paper further investigate the possibility on leveraging the separated harmonic component and percussive component by curating 2 CNNs which tries to understand harmonic audio and percussive audio in their natural form, one specialized in extracting deep features in time biased domain and another specialized in extracting deep features in frequency biased domain, respectively. The deep features extracted from these 2 CNNs will then be combined using bilinear pooling. Hence, presenting a two-stream time and frequency CNN architecture approach in classifying acoustic scene. The model is being evaluated on DCASE 2019 sub task 1a dataset and scored an average of 65% on development dataset, Kaggle Leadership Private and Public board.

下载PDF全文

下载文献需遵守相关版权规定

论文标题