语音音频信号的情感识别的多时间尺度卷积

论文标题

语音音频信号的情感识别的多时间尺度卷积

Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio Signals

论文作者

Guizzo, Eric, Weyde, Tillman, Leveson, Jack Barnett

论文摘要

对时间变化的鲁棒性对于语音音频的情绪识别很重要，因为情绪通过复杂的光谱模式被束缚，这些模式可以根据说话者和背景而在时间轴上表现出明显的局部扩张和压缩。为了解决这个问题和潜在的其他任务，我们介绍了多时间尺度（MTS）方法，以在分析音频数据的时间频表示时创建灵活性，以对时间变化产生灵活性。 MTS扩展了卷积神经网络的卷积内核，这些卷积内核被沿时间轴缩放和重新采样，以增加时间柔韧性，而不增加与标准卷积层相比的可训练参数的数量。我们使用4个不同尺寸的数据集评估了不同体系结构中的MT和标准卷积层，以从语音音频中识别情感。结果表明，与标准卷积相比

Robustness against temporal variations is important for emotion recognition from speech audio, since emotion is ex-pressed through complex spectral patterns that can exhibit significant local dilation and compression on the time axis depending on speaker and context. To address this and potentially other tasks, we introduce the multi-time-scale (MTS) method to create flexibility towards temporal variations when analyzing time-frequency representations of audio data. MTS extends convolutional neural networks with convolution kernels that are scaled and re-sampled along the time axis, to increase temporal flexibility without increasing the number of trainable parameters compared to standard convolutional layers. We evaluate MTS and standard convolutional layers in different architectures for emotion recognition from speech audio, using 4 datasets of different sizes. The results show that the use of MTS layers consistently improves the generalization of networks of different capacity and depth, compared to standard convolution, especially on smaller datasets

下载PDF全文

下载文献需遵守相关版权规定

论文标题