利用痴呆症检测的自发语音上的完全卷积网络和可视化技术

论文标题

利用痴呆症检测的自发语音上的完全卷积网络和可视化技术

Exploiting Fully Convolutional Network and Visualization Techniques on Spontaneous Speech for Dementia Detection

论文作者

Zhu, Youxiang, Liang, Xiaohui

论文摘要

在本文中，我们利用一个完全卷积网络（FCN）来分析自发语音的音频数据以进行痴呆检测。完全卷积的网络可容纳长度不同的语音样本，从而使我们能够分析语音样本而无需手动分割。具体来说，我们首先从每个参与者的音频数据中获取MEL频率CEPSTRAL系数（MFCC）功能映射，并将音频数据上的语音分类任务转换为MFCC功能图上的图像分类任务。然后，为了解决数据功能不全问题，我们通过采用Mobilenet架构和Imagenet数据集的预训练的主链卷积神经网络（CNN）模型来应用转移学习。我们进一步构建了一个卷积层，以使用OTSU的可视化方法来产生热图，从而使我们能够了解时间序列音段对分类结果的影响。我们证明，我们的分类模型在测试数据集上达到了66.7％，而Adress挑战中提供的基线模型的62.5％。通过可视化技术，我们可以评估音频段的影响，例如参与者的暂停和研究人员的重复问题，对分类结果。

In this paper, we exploit a Fully Convolutional Network (FCN) to analyze the audio data of spontaneous speech for dementia detection. A fully convolutional network accommodates speech samples with varying lengths, thus enabling us to analyze the speech sample without manual segmentation. Specifically, we first obtain the Mel Frequency Cepstral Coefficient (MFCC) feature map from each participant's audio data and convert the speech classification task on audio data to an image classification task on MFCC feature maps. Then, to solve the data insufficiency problem, we apply transfer learning by adopting a pre-trained backbone Convolutional Neural Network (CNN) model from the MobileNet architecture and the ImageNet dataset. We further build a convolutional layer to produce a heatmap using Otsu's method for visualization, enabling us to understand the impact of the time-series audio segments on the classification results. We demonstrate that our classification model achieves 66.7% over the testing dataset, 62.5% of the baseline model provided in the ADReSS challenge. Through the visualization technique, we can evaluate the impact of audio segments, such as filled pauses from the participants and repeated questions from the investigator, on the classification results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题