通过空间特征学习增强端到端的多通道语音分离

论文标题

通过空间特征学习增强端到端的多通道语音分离

Enhancing End-to-End Multi-channel Speech Separation via Spatial Feature Learning

论文作者

Gu, Rongzhi, Zhang, Shi-Xiong, Chen, Lianwu, Xu, Yong, Yu, Meng, Su, Dan, Zou, Yuexian, Yu, Dong

论文摘要

手工制作的空间特征（例如，通道间相位差异，IPD）在最近基于深度学习的多渠道语音分离（MCSS）方法中起着基本作用。但是，这些手动设计的空间特征很难将其整合到端到端优化的MCS框架中。在这项工作中，我们为直接从端到端语音分离框架内的多渠道语音波形直接从多渠道语音波形学习的集成体系结构提出了一个集成的体系结构。在此体系结构中，跨越信号通道的时间域滤波器进行了训练以执行自适应空间过滤。这些过滤器由2D卷积（Conv2D）层实现，其参数以纯粹数据驱动的方式使用语音分离目标函数进行了优化。此外，在IPD公式的启发下，我们设计了一个Conv2D内核来计算通道间卷积差异（ICD），这些差异有望提供有助于区分定向源的空间提示。对模拟多通道回响WSJ0 2-MIX数据集的评估结果表明，我们提出的基于ICD的MCSS模型将总体信号距离比率提高了10.4％，而基于IPD的MCSS模型。

Hand-crafted spatial features (e.g., inter-channel phase difference, IPD) play a fundamental role in recent deep learning based multi-channel speech separation (MCSS) methods. However, these manually designed spatial features are hard to incorporate into the end-to-end optimized MCSS framework. In this work, we propose an integrated architecture for learning spatial features directly from the multi-channel speech waveforms within an end-to-end speech separation framework. In this architecture, time-domain filters spanning signal channels are trained to perform adaptive spatial filtering. These filters are implemented by a 2d convolution (conv2d) layer and their parameters are optimized using a speech separation objective function in a purely data-driven fashion. Furthermore, inspired by the IPD formulation, we design a conv2d kernel to compute the inter-channel convolution differences (ICDs), which are expected to provide the spatial cues that help to distinguish the directional sources. Evaluation results on simulated multi-channel reverberant WSJ0 2-mix dataset demonstrate that our proposed ICD based MCSS model improves the overall signal-to-distortion ratio by 10.4% over the IPD based MCSS model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题