连续的度量学习，以识别可转移的语音情感识别并嵌入低农源语言

论文标题

连续的度量学习，以识别可转移的语音情感识别并嵌入低农源语言

Continuous Metric Learning For Transferable Speech Emotion Recognition and Embedding Across Low-resource Languages

论文作者

Das, Sneha, Lund, Nicklas Leander, Lønfeldt, Nicole Nadine, Pagsberg, Anne Katrine, Clemmensen, Line H.

论文摘要

语音情感识别〜（SER）是指从语音信号中推断个人的情绪状态的技术。由于其广泛的适用性，SERS继续引起人们的兴趣。尽管该领域主要建立在信号处理，机器学习和深度学习上，但对语言的推广仍然是一个挑战。但是，由于缺乏足够的资源和最常见的语言，开发可通用和可转移的模型至关重要。为了提高语言的性能，我们建议使用基于激活或价值的连续度量损失，通过半义务使用半义师的自动编码器。这项工作的新颖性在于我们关于连续公制学习的建议，这是我们最了解该主题的第一个建议之一。此外，为了解决传输数据集中缺乏激活和价标签，我们注释了信号样本，其激活和价水平对应于情绪的维度模型，然后用来评估嵌入在传输数据集中的嵌入质量。我们表明，所提出的半监督模型始终优于基线无监督方法，这是一种常规的denosive自动编码器，就情感分类准确性以及与维数变量相关的相关性。基于BERT的语音表示模型，对参考的分类准确性的进一步评估表明，该方法与参考方法相当，可以在较低的复杂性下对特定情绪类别进行分类。

Speech emotion recognition~(SER) refers to the technique of inferring the emotional state of an individual from speech signals. SERs continue to garner interest due to their wide applicability. Although the domain is mainly founded on signal processing, machine learning, and deep learning, generalizing over languages continues to remain a challenge. However, developing generalizable and transferable models are critical due to a lack of sufficient resources in terms of data and labels for languages beyond the most commonly spoken ones. To improve performance over languages, we propose a denoising autoencoder with semi-supervision using a continuous metric loss based on either activation or valence. The novelty of this work lies in our proposal of continuous metric learning, which is among the first proposals on the topic to the best of our knowledge. Furthermore, to address the lack of activation and valence labels in the transfer datasets, we annotate the signal samples with activation and valence levels corresponding to a dimensional model of emotions, which were then used to evaluate the quality of the embedding over the transfer datasets. We show that the proposed semi-supervised model consistently outperforms the baseline unsupervised method, which is a conventional denoising autoencoder, in terms of emotion classification accuracy as well as correlation with respect to the dimensional variables. Further evaluation of classification accuracy with respect to the reference, a BERT based speech representation model, shows that the proposed method is comparable to the reference method in classifying specific emotion classes at a much lower complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题