使用时间功能Fusion在多演讲器环境中验证扬声器验证

论文标题

使用时间功能Fusion在多演讲器环境中验证扬声器验证

Speaker Verification in Multi-Speaker Environments Using Temporal Feature Fusion

论文作者

Aloradi, Ahmad, Mack, Wolfgang, Elminshawi, Mohamed, Habets, Emanuël A. P.

论文摘要

在现代人机界面，例如确保隐私保护或实现生物识别认证的现代人机界面中，验证说话者的身份至关重要。经典的扬声器验证（SV）方法估算了固定维度嵌入的语音话语，该语音讲述说话者的语音特征。如果演讲者的声音嵌入与声称的发言人的嵌入相似，则可以验证演讲者。但是，这种方法假定输入中只有一个扬声器。并发扬声器的存在可能会对性能产生不利影响。为了在多演讲者的环境中解决SV，我们提出了一个基于端到端的深度学习SV系统，该系统检测目标扬声器是否存在于输入中。首先，从参考话语中估算出嵌入，以表示目标的特征。其次，从输入混合物估算框架级特征。然后，参考嵌入将与混合物的特征融合在一起，以使目标与其他扬声器在框架的基础上区分其他扬声器。最后，使用融合功能来预测目标扬声器是否在语音段中活跃。实验评估表明，所提出的方法在多演讲者条件下优于X-Vector。

Verifying the identity of a speaker is crucial in modern human-machine interfaces, e.g., to ensure privacy protection or to enable biometric authentication. Classical speaker verification (SV) approaches estimate a fixed-dimensional embedding from a speech utterance that encodes the speaker's voice characteristics. A speaker is verified if his/her voice embedding is sufficiently similar to the embedding of the claimed speaker. However, such approaches assume that only a single speaker exists in the input. The presence of concurrent speakers is likely to have detrimental effects on the performance. To address SV in a multi-speaker environment, we propose an end-to-end deep learning-based SV system that detects whether the target speaker exists within an input or not. First, an embedding is estimated from a reference utterance to represent the target's characteristics. Second, frame-level features are estimated from the input mixture. The reference embedding is then fused frame-wise with the mixture's features to allow distinguishing the target from other speakers on a frame basis. Finally, the fused features are used to predict whether the target speaker is active in the speech segment or not. Experimental evaluation shows that the proposed method outperforms the x-vector in multi-speaker conditions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题