论文标题
学习言语情感识别的细粒度跨模态兴奋
Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition
论文作者
论文摘要
语音情感识别是一项艰巨的任务,因为情绪表达是复杂的,多模式和细粒度的。在本文中,我们提出了一种新型的多模式深度学习方法,以从现实生活中表现出细粒度的情感识别。我们设计了一个时间对齐的平均最大汇总机制,以捕捉每种话语中暗示的微妙而细粒度的情绪。此外,我们提出了一个交叉模态兴奋模块,以对跨模态嵌入样品进行特定于样品的调整,并通过其其他模态的对齐潜在特征自适应地重新校准相应的值。我们提出的模型对两个著名的现实语音情感识别数据集进行了评估。结果表明,我们的方法在多模式语音话语的预测任务上是优越的,并且在预测准确性方面,它的表现优于广泛的基线。此外,我们进行了详细的消融研究,以表明我们的时间对齐平均最大合并机制和跨模态兴奋显着有助于有希望的结果。为了鼓励研究可重复性,我们在\ url {https://github.com/tal-ai/fg_cme.git}上公开提供代码。
Speech emotion recognition is a challenging task because the emotion expression is complex, multimodal and fine-grained. In this paper, we propose a novel multimodal deep learning approach to perform fine-grained emotion recognition from real-life speeches. We design a temporal alignment mean-max pooling mechanism to capture the subtle and fine-grained emotions implied in every utterance. In addition, we propose a cross modality excitement module to conduct sample-specific adjustment on cross modality embeddings and adaptively recalibrate the corresponding values by its aligned latent features from the other modality. Our proposed model is evaluated on two well-known real-world speech emotion recognition datasets. The results demonstrate that our approach is superior on the prediction tasks for multimodal speech utterances, and it outperforms a wide range of baselines in terms of prediction accuracy. Further more, we conduct detailed ablation studies to show that our temporal alignment mean-max pooling mechanism and cross modality excitement significantly contribute to the promising results. In order to encourage the research reproducibility, we make the code publicly available at \url{https://github.com/tal-ai/FG_CME.git}.