论文标题
端到端说话者提取中的目标混乱:分析和方法
Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches
论文作者
论文摘要
最近,端到端的扬声器提取引起了人们越来越多的关注,并显示出令人鼓舞的结果。但是,由于辅助扬声器编码器有时可能会产生模棱两可的扬声器嵌入,因此其性能通常不如与类似的网络架构的盲源分离(BSS)对应物相似。这种模棱两可的指导信息可能会使分离网络混淆,从而导致错误的提取结果,从而恶化整体性能。我们将其称为目标混乱问题。在本文中,我们对此类问题进行了分析,并分为两个阶段。在训练阶段,我们建议整合度量学习方法,以提高说话者编码器产生的嵌入的区分性。虽然进行推断,但一种新颖的过滤后策略旨在修改错误的结果。具体而言,我们首先通过测量输出估计和入学话语之间的相似性来识别这些混乱样本,然后通过减法操作恢复真正的目标源。实验表明,可以提高超过1DB SI-SDRI的性能,这证实了我们方法的有效性,并强调了目标混乱问题的影响。
Recently, end-to-end speaker extraction has attracted increasing attention and shown promising results. However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings. Such ambiguous guidance information may confuse the separation network and hence lead to wrong extraction results, which deteriorates the overall performance. We refer to this as the target confusion problem. In this paper, we conduct an analysis of such an issue and solve it in two stages. In the training phase, we propose to integrate metric learning methods to improve the distinguishability of embeddings produced by the speaker encoder. While for inference, a novel post-filtering strategy is designed to revise the wrong results. Specifically, we first identify these confusion samples by measuring the similarities between output estimates and enrollment utterances, after which the true target sources are recovered by a subtraction operation. Experiments show that performance improvement of more than 1dB SI-SDRi can be brought, which validates the effectiveness of our methods and emphasizes the impact of the target confusion problem.