论文标题
使用RES2NET体系结构重播和合成语音检测
Replay and Synthetic Speech Detection with Res2net Architecture
论文作者
论文摘要
现有的重播和综合语音检测方法仍然缺乏看不见的欺骗攻击的普遍性。这项工作建议利用一种新型的模型结构,即所谓的RES2NET,以改善反欺骗对策的普遍性。 RES2NET主要修改RESNET块以启用多个特征量表。具体而言,它将一个块中的特征图拆分为多个通道组,并在不同的通道组上设计一个类似残留的连接。这种连接增加了可能的接收场,从而导致多个特征量表。这种多重缩放机制可显着提高对策对看不见的欺骗攻击的普遍性。与基于RESNET的模型相比,它还降低了模型大小。实验结果表明,RES2NET模型始终优于ASVSPOOF 2019语料库的物理访问(PA)和逻辑访问(PA)和逻辑访问(LA)的较大边距RESNET34和RESNET50。此外,与挤压和兴奋(SE)块的集成可以进一步提高性能。对于功能工程,我们研究了RES2NET与不同的声学特征的普遍性,并观察到Constant-Q变换(CQT)在PA和LA场景中都达到了最有希望的性能。我们最好的单个系统优于ASVSPOOF 2019语料库的PA和LA中其他最先进的单个系统。
Existing approaches for replay and synthetic speech detection still lack generalizability to unseen spoofing attacks. This work proposes to leverage a novel model structure, so-called Res2Net, to improve the anti-spoofing countermeasure's generalizability. Res2Net mainly modifies the ResNet block to enable multiple feature scales. Specifically, it splits the feature maps within one block into multiple channel groups and designs a residual-like connection across different channel groups. Such connection increases the possible receptive fields, resulting in multiple feature scales. This multiple scaling mechanism significantly improves the countermeasure's generalizability to unseen spoofing attacks. It also decreases the model size compared to ResNet-based models. Experimental results show that the Res2Net model consistently outperforms ResNet34 and ResNet50 by a large margin in both physical access (PA) and logical access (LA) of the ASVspoof 2019 corpus. Moreover, integration with the squeeze-and-excitation (SE) block can further enhance performance. For feature engineering, we investigate the generalizability of Res2Net combined with different acoustic features, and observe that the constant-Q transform (CQT) achieves the most promising performance in both PA and LA scenarios. Our best single system outperforms other state-of-the-art single systems in both PA and LA of the ASVspoof 2019 corpus.