长度扬声器识别的长度和噪音感知的训练技术

论文标题

长度扬声器识别的长度和噪音感知的训练技术

Length- and Noise-aware Training Techniques for Short-utterance Speaker Recognition

论文作者

Chen, Wenda, Huang, Jonathan, Bocklet, Tobias

论文摘要

随着深度学习的出现，演讲者的认可表现得到了极大的改善。深度神经网络显示出有效处理噪声和混响的影响的能力，使其对远场扬声器识别系统有吸引力。 X-vector框架是在最近文献中生成扬声器嵌入的流行选择，因为其强大的训练机制和各种测试集的表现出色。在本文中，我们从早期的工作开始，包括不变的表示学习（IRL）到损失函数中，并通过质心对齐（CA）（CA）和长度可变性成本（LVC）技术修改方法，以进一步改善噪声，远场应用的稳健性。这项工作主要集中于短期测试话语（1-8）的改进。我们还提出了长期任务的改进结果。此外，这项工作还讨论了一种新颖的自我注意机制。在远场语料库上，提出的技术的组合在极短的情况下实现了7.0％的相对改善，而在我们的基线系统上，以相等的错误率（EER）的全持续测试说法获得了8.2％。

Speaker recognition performance has been greatly improved with the emergence of deep learning. Deep neural networks show the capacity to effectively deal with impacts of noise and reverberation, making them attractive to far-field speaker recognition systems. The x-vector framework is a popular choice for generating speaker embeddings in recent literature due to its robust training mechanism and excellent performance in various test sets. In this paper, we start with early work on including invariant representation learning (IRL) to the loss function and modify the approach with centroid alignment (CA) and length variability cost (LVC) techniques to further improve robustness in noisy, far-field applications. This work mainly focuses on improvements for short-duration test utterances (1-8s). We also present improved results on long-duration tasks. In addition, this work discusses a novel self-attention mechanism. On the VOiCES far-field corpus, the combination of the proposed techniques achieves relative improvements of 7.0% for extremely short and 8.2% for full-duration test utterances on equal error rate (EER) over our baseline system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题