层次变压器网络，用于语音级情感识别

论文标题

层次变压器网络，用于语音级情感识别

Hierarchical Transformer Network for Utterance-level Emotion Recognition

论文作者

Li, QingBiao, Wu, ChunHua, Zheng, KangFeng, Wang, Zhe

论文摘要

尽管在文本中取消表明情绪的发展取得了重大进展，但在独算一度的情感识别（Uler）领域中，仍然有许多问题要解决。在本文中，我们解决了对话系统中的Uler中的一些挑战。（1）当在不同的情况下或来自不同的说话者时，相同的话语可以传达不同的情绪。（2）很难有效捕获远距离上下文形式。（3）与传统的文本分类问题不同，此任务由有限数量的数据集支持，其中大多数包含对话不足或语音不足。为了解决这些问题，我们提出了一个层次变压器框架（除了对其他研究的描述外，本文中的“变压器”通常是指变压器的编码器部分），使用较低级别的变压器来对单词级别的输入和高级变压器进行建模，以捕获出色的上下文嵌入式嵌入。我们使用来自变形金刚（BERT）的验证语言模型的双向编码器代表作为低级变压器，这相当于将外部数据引入模型并在某种程度上解决数据短缺问题。此外，我们首次将扬声器嵌入式添加到模型中，这使我们的模型能够捕获扬声器之间的不适。在三个对话情绪数据集，朋友，Emotionpush和Emorynlp上进行的实验表明，我们提出的层次结构变压器网络模型在每个数据集上的麦克罗-F1的每个数据集上的最新方法都取得了1.98％，2.83％和3.94％的改善，而将其改进。

While there have been significant advances in de-tecting emotions in text, in the field of utter-ance-level emotion recognition (ULER), there are still many problems to be solved. In this paper, we address some challenges in ULER in dialog sys-tems. (1) The same utterance can deliver different emotions when it is in different contexts or from different speakers. (2) Long-range contextual in-formation is hard to effectively capture. (3) Unlike the traditional text classification problem, this task is supported by a limited number of datasets, among which most contain inadequate conversa-tions or speech. To address these problems, we propose a hierarchical transformer framework (apart from the description of other studies, the "transformer" in this paper usually refers to the encoder part of the transformer) with a lower-level transformer to model the word-level input and an upper-level transformer to capture the context of utterance-level embeddings. We use a pretrained language model bidirectional encoder representa-tions from transformers (BERT) as the lower-level transformer, which is equivalent to introducing external data into the model and solve the problem of data shortage to some extent. In addition, we add speaker embeddings to the model for the first time, which enables our model to capture the in-teraction between speakers. Experiments on three dialog emotion datasets, Friends, EmotionPush, and EmoryNLP, demonstrate that our proposed hierarchical transformer network models achieve 1.98%, 2.83%, and 3.94% improvement, respec-tively, over the state-of-the-art methods on each dataset in terms of macro-F1.

下载PDF全文

下载文献需遵守相关版权规定

论文标题