论文标题
DBT-NET:双分支联邦幅度和相位估计,具有注意力转换器的单声道语音增强
DBT-Net: Dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement
论文作者
论文摘要
在语音增强区域中,脱钩式概念开始点燃,该概念将原始的复杂频谱估计任务分解为多个更轻松的子任务,即仅幅度恢复和残留的复杂频谱估计)},从而提供更好的性能和更容易的可解释性。在本文中,我们提出了一个被称为DBT-NET的双支分支联邦幅度和相位估计框架,用于单声道语音增强,旨在恢复并行整体光谱的粗粒和细粒区域。从互补的角度来看,幅度估计分支的设计旨在滤除幅度域中的显性噪声组件,而复杂的光谱纯化分支的精心设计旨在对缺失的频谱细节进行涂漆,并暗中估计复杂价值的光谱域中的相位信息。为了促进每个分支之间的信息流,引入了相互作用模块以利用从一个分支学到的特征,以抑制不希望的零件并恢复另一个分支的缺失组件。我们没有采用常规的RNN和时间卷积网络进行序列建模,而是在每个分支中采用了一种新型的基于注意力的变压器网络来进行更好的特征学习。更特别地,它由几个自适应光谱型注意力转换器基于变压器的模块和一个自适应分层注意模块组成,旨在捕获长期的时间频率依赖性并进一步汇总中间的层次层次上下文信息。对WSJ0-SI84 + DNS-challenge和VoiceBank +需求数据集的全面评估表明,所提出的方法在语音质量和清晰度方面始终优于先前的高级系统,并且在语音质量和清晰度方面产生最先进的性能。
The decoupling-style concept begins to ignite in the speech enhancement area, which decouples the original complex spectrum estimation task into multiple easier sub-tasks i.e., magnitude-only recovery and the residual complex spectrum estimation)}, resulting in better performance and easier interpretability. In this paper, we propose a dual-branch federative magnitude and phase estimation framework, dubbed DBT-Net, for monaural speech enhancement, aiming at recovering the coarse- and fine-grained regions of the overall spectrum in parallel. From the complementary perspective, the magnitude estimation branch is designed to filter out dominant noise components in the magnitude domain, while the complex spectrum purification branch is elaborately designed to inpaint the missing spectral details and implicitly estimate the phase information in the complex-valued spectral domain. To facilitate the information flow between each branch, interaction modules are introduced to leverage features learned from one branch, so as to suppress the undesired parts and recover the missing components of the other branch. Instead of adopting the conventional RNNs and temporal convolutional networks for sequence modeling, we employ a novel attention-in-attention transformer-based network within each branch for better feature learning. More specially, it is composed of several adaptive spectro-temporal attention transformer-based modules and an adaptive hierarchical attention module, aiming to capture long-term time-frequency dependencies and further aggregate intermediate hierarchical contextual information. Comprehensive evaluations on the WSJ0-SI84 + DNS-Challenge and VoiceBank + DEMAND dataset demonstrate that the proposed approach consistently outperforms previous advanced systems and yields state-of-the-art performance in terms of speech quality and intelligibility.