论文标题
验证的音频神经网络,用于葡萄牙语中的语音情感识别
Pretrained audio neural networks for Speech emotion recognition in Portuguese
论文作者
论文摘要
语音情感识别(SER)的目标是确定言语的情感方面。根据副语言元素(笑声,哭泣等),提出了对巴西葡萄牙语的Ser挑战,该葡萄牙语是用简短的葡萄牙语片段提出的。该数据集包含约50美元的巴西葡萄牙语演讲。随着数据集偏向小侧,我们研究了传输学习和数据增强技术的组合是否可以产生积极的结果。因此,通过将一种称为Specaugment的数据增强技术与使用预处理的音频神经网络(PANN)进行转移学习,我们能够获得有趣的结果。 PANNS(CNN6,CNN10和CNN14)在一个称为Audioset的大数据集上预估计,该数据集包含超过5000美元的音频。他们在SER数据集上进行了审核,验证集的最佳性能模型(CNN10)已提交挑战,从挑战提供的基线中获得了$ 0.73 $的$ f1 $得分。此外,我们还测试了变压器神经体系结构的使用,该架构预估计了约600美元的巴西葡萄牙音频数据。变形金刚以及更复杂的panns模型(CNN14)未能推广到SER数据集中的测试集,并且不会击败基线。考虑到数据集大小的限制,目前的SER最佳方法是使用PANN(特别是CNN6和CNN10)。
The goal of speech emotion recognition (SER) is to identify the emotional aspects of speech. The SER challenge for Brazilian Portuguese speech was proposed with short snippets of Portuguese which are classified as neutral, non-neutral female and non-neutral male according to paralinguistic elements (laughing, crying, etc). This dataset contains about $50$ minutes of Brazilian Portuguese speech. As the dataset leans on the small side, we investigate whether a combination of transfer learning and data augmentation techniques can produce positive results. Thus, by combining a data augmentation technique called SpecAugment, with the use of Pretrained Audio Neural Networks (PANNs) for transfer learning we are able to obtain interesting results. The PANNs (CNN6, CNN10 and CNN14) are pretrained on a large dataset called AudioSet containing more than $5000$ hours of audio. They were finetuned on the SER dataset and the best performing model (CNN10) on the validation set was submitted to the challenge, achieving an $F1$ score of $0.73$ up from $0.54$ from the baselines provided by the challenge. Moreover, we also tested the use of Transformer neural architecture, pretrained on about $600$ hours of Brazilian Portuguese audio data. Transformers, as well as more complex models of PANNs (CNN14), fail to generalize to the test set in the SER dataset and do not beat the baseline. Considering the limitation of the dataset sizes, currently the best approach for SER is using PANNs (specifically, CNN6 and CNN10).