论文标题
多微粒光谱复杂的光谱映射,以言语和连续的语音分离
Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation
论文作者
论文摘要
我们提出了多微晶复合谱图映射,这是一种在回响条件下施加深度学习的简单方法,以施加时间变化的非线性光束。我们的目标是说话者的分离和替补架。我们的研究首先调查了脱机话语的说话者分离,然后扩展到连续的语音分离(CSS)。假设训练和测试之间有固定的阵列几何形状,我们训练深层神经网络(DNN),以预测来自多个麦克风的RI组件的参考麦克风,预测目标语音的真实和虚构(RI)成分。然后,我们将多微晶复合频谱映射整合在一起,最小差异无失真响应(MVDR)和过滤后进行过滤,以进一步改善分离,并将其与框架级别的扬声器相结合,以计算块端口CSS。尽管我们的系统经过基于在给定几何形状中排列的固定数量的麦克风(RIR)的模拟房间脉冲响应(RIR)进行了训练,但它可以很好地推广到具有相同几何形状的真实阵列。最新的分离性能是在模拟的两位对象SMS-WSJ语料库和现实录制的Librarics数据集上获得的。
We propose multi-microphone complex spectral mapping, a simple way of applying deep learning for time-varying non-linear beamforming, for speaker separation in reverberant conditions. We aim at both speaker separation and dereverberation. Our study first investigates offline utterance-wise speaker separation and then extends to block-online continuous speech separation (CSS). Assuming a fixed array geometry between training and testing, we train deep neural networks (DNN) to predict the real and imaginary (RI) components of target speech at a reference microphone from the RI components of multiple microphones. We then integrate multi-microphone complex spectral mapping with minimum variance distortionless response (MVDR) beamforming and post-filtering to further improve separation, and combine it with frame-level speaker counting for block-online CSS. Although our system is trained on simulated room impulse responses (RIR) based on a fixed number of microphones arranged in a given geometry, it generalizes well to a real array with the same geometry. State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.