扬声器验证的双头注意力

论文标题

扬声器验证的双头注意力

Double Multi-Head Attention for Speaker Verification

论文作者

India, Miquel, Safari, Pooyan, Hernando, Javier

论文摘要

大多数用于说话者验证的最先进的深度学习系统都是基于说话者嵌入提取器的。这些架构通常由特征提取器前端和池层组成，以将可变长度的话语编码为固定长度的扬声器向量。在本文中，我们介绍了双头注意集合，该池基于自我多头注意，扩展了我们以前的方法。在池层中添加了一个额外的自我注意力层，该层总结了通过多头注意力产生的上下文向量为独特的扬声器表示形式。该方法通过给每个头部捕获的信息提供权重来增强合并机制，并导致创建更具歧视性的扬声器嵌入。我们已经通过Voxceleb2数据集评估了我们的方法。我们的结果表明，与自我注意力集和自我多头注意相比，EER的相对改善为6.09％和5.23％。根据所获得的结果，双重头脑的注意已显示为有效地从语音信号中基于CNN的前端捕获的最相关的功能，是一种出色的方法。

Most state-of-the-art Deep Learning systems for speaker verification are based on speaker embedding extractors. These architectures are commonly composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. In this paper we present Double Multi-Head Attention pooling, which extends our previous approach based on Self Multi-Head Attention. An additional self attention layer is added to the pooling layer that summarizes the context vectors produced by Multi-Head Attention into a unique speaker representation. This method enhances the pooling mechanism by giving weights to the information captured for each head and it results in creating more discriminative speaker embeddings. We have evaluated our approach with the VoxCeleb2 dataset. Our results show 6.09% and 5.23% relative improvement in terms of EER compared to Self Attention pooling and Self Multi-Head Attention, respectively. According to the obtained results, Double Multi-Head Attention has shown to be an excellent approach to efficiently select the most relevant features captured by the CNN-based front-ends from the speech signal.

下载PDF全文

下载文献需遵守相关版权规定

论文标题