论文标题
通过融合时间同步和时间异步表示的情感识别
Emotion recognition by fusing time synchronous and time asynchronous representations
论文作者
论文摘要
在本文中,提出了一种新型的两分支神经网络模型结构,用于多模式情绪识别,该结构由时间同步分支(TSB)和时间异步分支(TAB)组成。为了捕获每个单词及其声学实现之间的相关性,TSB在每个输入窗口框架上结合了语音和文本方式,然后在时间上进行汇总以形成一个单个嵌入向量。相比之下,该选项卡通过将句子嵌入从许多上下文话语集成到另一个嵌入向量的句子来提供跨牙的信息。最终的情感分类同时使用TSB和TAB嵌入。 IEMOCAP数据集的实验结果表明,两分支结构实现最新的结构,并通过所有常见的测试设置进行了四向分类。当使用自动语音识别(ASR)输出而不是手动转录的参考文本时,可以显示交叉剂量信息大大提高了针对ASR错误的鲁棒性。此外,通过为所有其他情绪纳入额外的类别,最终的5向分类系统与ASR假设可以看作是更现实的情感识别系统的原型。
In this paper, a novel two-branch neural network model structure is proposed for multimodal emotion recognition, which consists of a time synchronous branch (TSB) and a time asynchronous branch (TAB). To capture correlations between each word and its acoustic realisation, the TSB combines speech and text modalities at each input window frame and then does pooling across time to form a single embedding vector. The TAB, by contrast, provides cross-utterance information by integrating sentence text embeddings from a number of context utterances into another embedding vector. The final emotion classification uses both the TSB and the TAB embeddings. Experimental results on the IEMOCAP dataset demonstrate that the two-branch structure achieves state-of-the-art results in 4-way classification with all common test setups. When using automatic speech recognition (ASR) output instead of manually transcribed reference text, it is shown that the cross-utterance information considerably improves the robustness against ASR errors. Furthermore, by incorporating an extra class for all the other emotions, the final 5-way classification system with ASR hypotheses can be viewed as a prototype for more realistic emotion recognition systems.