论文标题
DT-SV:一种基于变压器的时间域方法用于说话者验证
DT-SV: A Transformer-based Time-domain Approach for Speaker Verification
论文作者
论文摘要
发言人验证(SV)旨在确定说话者对测试话语的身份是否与参考语音相同。在过去的几年中,使用深层神经网络为SV系统提取扬声器嵌入已成为主流。最近,在SV领域广泛探索了不同的注意机制和变压器网络。但是,直接利用SV中的原始变压器可能会在输出功能上具有框架级信息浪费,这可能会导致对说话者嵌入的容量和歧视的限制。因此,我们提出了一种通过变压器体系结构来得出话语级扬声器嵌入的方法,该构造使用新颖的损失函数,名为“差异损失”来整合不同变压器层的特征信息。在其中,差异损失旨在将框架级级特征汇总为话语级表示,并且可以方便地集成到变压器中。此外,我们还引入了一个可学习的MEL-FBANK能量提取器,名为Time域功能提取器,该提取器比标准的Mel-Fbank提取器更精确,更有效地计算MEL-FBANK。结合了差异损失和时间域特征提取器,我们提出了一种新型的基于变压器的时域SV模型(DT-SV),并具有更快的训练速度和更高的精度。实验表明,与其他模型相比,我们提出的模型可以实现更好的性能。
Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech. In the past few years, extracting speaker embeddings using deep neural networks for SV systems has gone mainstream. Recently, different attention mechanisms and Transformer networks have been explored widely in SV fields. However, utilizing the original Transformer in SV directly may have frame-level information waste on output features, which could lead to restrictions on capacity and discrimination of speaker embeddings. Therefore, we propose an approach to derive utterance-level speaker embeddings via a Transformer architecture that uses a novel loss function named diffluence loss to integrate the feature information of different Transformer layers. Therein, the diffluence loss aims to aggregate frame-level features into an utterance-level representation, and it could be integrated into the Transformer expediently. Besides, we also introduce a learnable mel-fbank energy feature extractor named time-domain feature extractor that computes the mel-fbank features more precisely and efficiently than the standard mel-fbank extractor. Combining Diffluence loss and Time-domain feature extractor, we propose a novel Transformer-based time-domain SV model (DT-SV) with faster training speed and higher accuracy. Experiments indicate that our proposed model can achieve better performance in comparison with other models.