基于双向LSTM和时间分布的CNN的韵律和语义特征的抑郁严重程度的预测

论文标题

基于双向LSTM和时间分布的CNN的韵律和语义特征的抑郁严重程度的预测

Prediction of Depression Severity Based on the Prosodic and Semantic Features with Bidirectional LSTM and Time Distributed CNN

论文作者

Mao, Kaining, Zhang, Wei, Wang, Deborah Baofeng, Li, Ang, Jiao, Rongqi, Zhu, Yanhui, Wu, Bin, Zheng, Tiansheng, Qian, Lei, Lyu, Wei, Ye, Minjie, Chen, Jie

论文摘要

抑郁症在全球范围内越来越多地影响个人。它已成为全球主要的公共卫生问题，并引起了各个研究领域的关注。传统上，抑郁症的诊断是通过半结构化访谈和补充问卷提出的，这使得诊断在很大程度上依赖医生的经验并受到偏见。可以通过自动抑郁诊断系统实施心理健康监测和基于云的远程诊断。在本文中，我们提出了一种基于注意力的多模式语音和抑郁预测的文本表示。我们的模型经过培训，可以使用遇险分析访谈绿野仙踪（DAIC-WOZ）数据集估算参与者的抑郁严重程度。对于音频方式，我们使用数据集提供的协作语音分析存储库（VORAPEP）功能，并采用双向长期短期内存网络（BI-LSTM），然后是时间分配的卷积神经网络（T-CNN）。对于文本模式，我们将全局向量用于单词表示（手套）来执行单词嵌入，并且将嵌入量被馈入BI-LSTM网络。结果表明，音频和文本模型在抑郁严重程度估计任务上均表现良好，最佳序列级别F1得分为0.9870，患者级别的F1得分为0.9074，在五个类别（健康，轻度，中度，中度，中度严重和严重）的音频模型中，序列模型为0.9074，序列F1得分为0.9709和0.9709和患者级别的F1级别的F1和0.9245的序列F1评分。多模式融合模型的结果相似，在五个类别的患者级抑郁症检测任务上，F1得分为0.9580。实验表明，对以前的工作具有统计学上的显着改善。

Depression is increasingly impacting individuals both physically and psychologically worldwide. It has become a global major public health problem and attracts attention from various research fields. Traditionally, the diagnosis of depression is formulated through semi-structured interviews and supplementary questionnaires, which makes the diagnosis heavily relying on physicians experience and is subject to bias. Mental health monitoring and cloud-based remote diagnosis can be implemented through an automated depression diagnosis system. In this article, we propose an attention-based multimodality speech and text representation for depression prediction. Our model is trained to estimate the depression severity of participants using the Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ) dataset. For the audio modality, we use the collaborative voice analysis repository (COVAREP) features provided by the dataset and employ a Bidirectional Long Short-Term Memory Network (Bi-LSTM) followed by a Time-distributed Convolutional Neural Network (T-CNN). For the text modality, we use global vectors for word representation (GloVe) to perform word embeddings and the embeddings are fed into the Bi-LSTM network. Results show that both audio and text models perform well on the depression severity estimation task, with best sequence level F1 score of 0.9870 and patient-level F1 score of 0.9074 for the audio model over five classes (healthy, mild, moderate, moderately severe, and severe), as well as sequence level F1 score of 0.9709 and patient-level F1 score of 0.9245 for the text model over five classes. Results are similar for the multimodality fused model, with the highest F1 score of 0.9580 on the patient-level depression detection task over five classes. Experiments show statistically significant improvements over previous works.

下载PDF全文

下载文献需遵守相关版权规定

论文标题