端到端说话者的语音活动检测

论文标题

端到端说话者的语音活动检测

End-to-End Speaker-Dependent Voice Activity Detection

论文作者

Chen, Yefei, Wang, Shuai, Qian, Yanmin, Yu, Kai

论文摘要

语音活动检测（VAD）是诸如自动语音识别（ASR）和说话者识别等任务的重要预处理步骤。一个基本目标是消除音频中的无声段，而更通用的VAD系统可以消除所有无关的段，例如噪声，甚至是非目标扬声器的不必要的语音。我们定义了任务，该任务仅检测到目标说话者的语音为依赖说话者的语音活动检测（SDVAD）。此任务在实际应用中非常普遍，通常通过从VAD提取的音频段执行说话者验证（SV）来实现。在本文中，我们提出了一种基于端到端神经网络的方法来解决此问题，该方法明确地将说话者的身份带入了建模过程。此外，可以以在线方式进行推论，从而导致系统延迟较低。实验是在从总机语料库生成的对话电话数据集上进行的。结果表明，就框架准确性和F得分而言，我们提出的在线方法的性能要比通常的VAD/SV系统要好得多。我们还使用了先前提出的细分级指标进行更全面的分析。

Voice activity detection (VAD) is an essential pre-processing step for tasks such as automatic speech recognition (ASR) and speaker recognition. A basic goal is to remove silent segments within an audio, while a more general VAD system could remove all the irrelevant segments such as noise and even unwanted speech from non-target speakers. We define the task, which only detects the speech from the target speaker, as speaker-dependent voice activity detection (SDVAD). This task is quite common in real applications and usually implemented by performing speaker verification (SV) on audio segments extracted from VAD. In this paper, we propose an end-to-end neural network based approach to address this problem, which explicitly takes the speaker identity into the modeling process. Moreover, inference can be performed in an online fashion, which leads to low system latency. Experiments are carried out on a conversational telephone dataset generated from the Switchboard corpus. Results show that our proposed online approach achieves significantly better performance than the usual VAD/SV system in terms of both frame accuracy and F-score. We also used our previously proposed segment-level metric for a more comprehensive analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题