论文标题
用于多通道语音分离的隐式滤波器网络
Implicit Filter-and-sum Network for Multi-channel Speech Separation
论文作者
论文摘要
近年来,已经提出了各种神经网络体系结构,以进行多通道语音分离的任务。其中,滤波器和-MUM网络(FASNET)执行端到端的时域滤波器和光束形成,并且在临时和固定麦克风阵列几何形状中显示出有效的作用。在本文中,我们研究了提高FASNET性能的多种方法。从问题公式的角度来看,我们更改了仅参考麦克风的潜在空间中的所有麦克风的明确时间域滤波器和-AM操作。过滤器和和-MUM操作是在要分开的框架周围的上下文上应用的。这使问题公式可以更好地匹配端到端分离的目标。从特征提取的角度来看,我们将样品级归一化跨相关性(NCC)特征的计算中的计算为特征级别的NCC(FNCC)特征。这使该模型更好地匹配了隐式滤波器和和-MUM公式。临时和固定麦克风阵列几何形状的实验结果表明,我们称为IFASNET的FASNET所提出的修改能够在所有条件上都显着胜过基准Fasnet,并具有PAR模型复杂性的所有条件。
Various neural network architectures have been proposed in recent years for the task of multi-channel speech separation. Among them, the filter-and-sum network (FaSNet) performs end-to-end time-domain filter-and-sum beamforming and has shown effective in both ad-hoc and fixed microphone array geometries. In this paper, we investigate multiple ways to improve the performance of FaSNet. From the problem formulation perspective, we change the explicit time-domain filter-and-sum operation which involves all the microphones into an implicit filter-and-sum operation in the latent space of only the reference microphone. The filter-and-sum operation is applied on a context around the frame to be separated. This allows the problem formulation to better match the objective of end-to-end separation. From the feature extraction perspective, we modify the calculation of sample-level normalized cross correlation (NCC) features into feature-level NCC (fNCC) features. This makes the model better matches the implicit filter-and-sum formulation. Experiment results on both ad-hoc and fixed microphone array geometries show that the proposed modification to the FaSNet, which we refer to as iFaSNet, is able to significantly outperform the benchmark FaSNet across all conditions with an on par model complexity.