CNN自我发场语音活动探测器

论文标题

CNN自我发场语音活动探测器

CNN self-attention voice activity detector

论文作者

Sofer, Amit, Chazan, Shlomo E.

论文摘要

在这项工作中，我们提出了一种新型的单渠道语音活动探测器（VAD）方法。我们利用卷积神经网络（CNN），该神经网络（CNN）利用噪声输入频谱的空间信息来提取框架嵌入序列，然后是自我注意（SA）编码器，其目标是从嵌入序列中找到上下文信息。与以前在每个帧（带有上下文框架）上使用的工作不同，我们的方法能够一次处理整个信号，从而实现长期接收场。我们表明，CNN和SA体系结构的融合优于仅基于CNN和SA的方法。广泛的实验研究表明，我们的模型在现实生活基准上优于以前的模型，并提供了相对较小且轻巧的模型的最新技术（SOTA）结果。

In this work we present a novel single-channel Voice Activity Detector (VAD) approach. We utilize a Convolutional Neural Network (CNN) which exploits the spatial information of the noisy input spectrum to extract frame-wise embedding sequence, followed by a Self Attention (SA) Encoder with a goal of finding contextual information from the embedding sequence. Different from previous works which were employed on each frame (with context frames) separately, our method is capable of processing the entire signal at once, and thus enabling long receptive field. We show that the fusion of CNN and SA architectures outperforms methods based solely on CNN and SA. Extensive experimental-study shows that our model outperforms previous models on real-life benchmarks, and provides State Of The Art (SOTA) results with relatively small and lightweight model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题