论文标题

基于多语言语音识别系统的语音活动检测

Speech Activity Detection Based on Multilingual Speech Recognition System

论文作者

Sarfjoo, Seyyed Saeed, Madikeri, Srikanth, Motlicek, Petr

论文摘要

为了更好地建模上下文信息并提高语音活动检测(SAD)系统的概括能力,本文利用多语言自动语音识别(ASR)系统来执行SAD。使用无晶格的最大互信息(LF-MMI)损耗函数对声学模型(AM)进行序列判别训练,有效地提取了输入声学框架的上下文信息。多语言AM培训会导致噪声和语言变化的鲁棒性。最大输出后验的索引被视为框架级语音/非语音决策功能。多数投票和逻辑回归用于融合与语言有关的决策。多语言ASR经过了18种Babel数据集的语言培训,并在3种不同的语言上评估了内置的SAD。在室外数据集上,提出的SAD模型在基线模型方面表现出明显更好的性能。在ester2数据集上,在不使用任何域内数据的情况下,该模型在检测误差率(DETER)度量中优于WEBRTC,基于音素识别器的VAD(PHN REC)和Pyannote Baseline(分别为7.1、1.7和2.7%绝对)。同样,在LiveATC数据集上,该模型在DITER指标中的表现优于WEBRTC,PHN REC和PYANNOTE基准(分别为6.4、10.0和3.7%)。

To better model the contextual information and increase the generalization ability of Speech Activity Detection (SAD) system, this paper leverages a multi-lingual Automatic Speech Recognition (ASR) system to perform SAD. Sequence discriminative training of Acoustic Model (AM) using Lattice-Free Maximum Mutual Information (LF-MMI) loss function, effectively extracts the contextual information of the input acoustic frame. Multi-lingual AM training, causes the robustness to noise and language variabilities. The index of maximum output posterior is considered as a frame-level speech/non-speech decision function. Majority voting and logistic regression are applied to fuse the language-dependent decisions. The multi-lingual ASR is trained on 18 languages of BABEL datasets and the built SAD is evaluated on 3 different languages. On out-of-domain datasets, the proposed SAD model shows significantly better performance with respect to baseline models. On the Ester2 dataset, without using any in-domain data, this model outperforms the WebRTC, phoneme recognizer based VAD (Phn Rec), and Pyannote baselines (respectively by 7.1, 1.7, and 2.7% absolute) in Detection Error Rate (DetER) metrics. Similarly, on the LiveATC dataset, this model outperforms the WebRTC, Phn Rec, and Pyannote baselines (respectively by 6.4, 10.0, and 3.7% absolutely) in DetER metrics.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源