论文标题
半监督时域目标扬声器提取引起注意
Semi-supervised Time Domain Target Speaker Extraction with Attention
论文作者
论文摘要
在这项工作中,我们提出了Exformer,这是目标扬声器提取的时间域架构。它由基于变压器编码器块的预先训练的扬声器嵌入式网络和一个分离器网络组成。我们研究了多种方法,将扬声器信息与输入混合物相结合,与先前的时间域网络相比,所得的外观架构获得了卓越的提取性能。此外,我们研究了一个两阶段的程序,以使用预训练的监督模型而没有参考信号的混合物来训练模型。实验结果表明,提出的半监督学习程序可改善监督基线的性能。
In this work, we propose Exformer, a time-domain architecture for target speaker extraction. It consists of a pre-trained speaker embedder network and a separator network based on transformer encoder blocks. We study multiple methods to combine speaker information with the input mixture, and the resulting Exformer architecture obtains superior extraction performance compared to prior time-domain networks. Furthermore, we investigate a two-stage procedure to train the model using mixtures without reference signals upon a pre-trained supervised model. Experimental results show that the proposed semi-supervised learning procedure improves the performance of the supervised baselines.