深扬声器嵌入式的后端选择

论文标题

深扬声器嵌入式的后端选择

Back-ends Selection for Deep Speaker Embeddings

论文作者

Li, Zhuo, Xiao, Runqiu, Zhang, Zihan, Zhao, Zhenduo, Wang, Wenchao, Zhang, Pengyuan

论文摘要

概率线性判别分析（PLDA）是早期说话者识别方法的主要和必要的后端，例如I-vector和X-Vector。但是，随着神经网络的发展和基于保证金的损失功能，我们可以获得深层说话者的嵌入（DSE），这些嵌入者具有增加阶层间分离和较小的阶层距离的优势。在这种情况下，对于歧视性嵌入，PLDA似乎不必要甚至适得其反，并且在某些情况下，余弦相似性评分（COS）的性能比PLDA更好。在本文中，我们系统地探讨了如何选择后端（COS或PLDA）以进行深扬声器嵌入以在不同情况下取得更好的性能。通过分析PLDA和从具有不同数量的细分级层的模型中提取的DSE的性质，我们可以猜想在同一域的情况下COS更好，而PLDA在跨域情况下更好。我们在四种应用情况下进行了Voxceleb和NIST SRE数据集的实验，单/多域训练和相同/跨域测试，以验证我们的猜想，并简要解释为什么后端适应性算法起作用。

Probabilistic Linear Discriminant Analysis (PLDA) was the dominant and necessary back-end for early speaker recognition approaches, like i-vector and x-vector. However, with the development of neural networks and margin-based loss functions, we can obtain deep speaker embeddings (DSEs), which have advantages of increased inter-class separation and smaller intra-class distances. In this case, PLDA seems unnecessary or even counterproductive for the discriminative embeddings, and cosine similarity scoring (Cos) achieves better performance than PLDA in some situations. Motivated by this, in this paper, we systematically explore how to select back-ends (Cos or PLDA) for deep speaker embeddings to achieve better performance in different situations. By analyzing PLDA and the properties of DSEs extracted from models with different numbers of segment-level layers, we make the conjecture that Cos is better in same-domain situations and PLDA is better in cross-domain situations. We conduct experiments on VoxCeleb and NIST SRE datasets in four application situations, single-/multi-domain training and same-/cross-domain test, to validate our conjecture and briefly explain why back-ends adaption algorithms work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题