基于掩模的神经光束成型，用于具有自我发挥的跟踪的移动扬声器

论文标题

基于掩模的神经光束成型，用于具有自我发挥的跟踪的移动扬声器

Mask-based Neural Beamforming for Moving Speakers with Self-Attention-based Tracking

论文作者

Ochiai, Tsubasa, Delcroix, Marc, Nakatani, Tomohiro, Araki, Shoko

论文摘要

Beam Forming是一种强大的工具，旨在从目标源的方向增强语音信号。计算光边成型过滤器需要估计源和噪声信号的空间协方差矩阵（SCM）。时间频面掩模通常用于计算这些SCM。大多数基于掩模的束缚的研究都认为这些来源不会移动。但是，来源经常在实践中移动，这会导致性能退化。在本文中，我们解决了用于移动源的基于掩模的束缚面积的问题。我们首先回顾了跟踪移动源的经典方法，该方法可以在线或块计算SCMS。我们表明，这些方法可以解释为计算由注意力权重加权的瞬时SCM。这些权重表明在SCM计算中要考虑的信号的时间帧。在线或块计算采用了计算这些注意力权重的启发式和确定性方法，尽管简单，但可能不会导致最佳性能。因此，我们引入了一个基于学习的框架，该框架计算最佳的注意力重量以进行波束形成。我们使用具有自发层的神经网络实现了这一目标。我们通过实验表明，我们提出的框架可以在移动源情况下大大提高波束形成性能，同时在非移动情况下保持高性能，从而使基于掩码的光束器能够强大地发展源头。

Beamforming is a powerful tool designed to enhance speech signals from the direction of a target source. Computing the beamforming filter requires estimating spatial covariance matrices (SCMs) of the source and noise signals. Time-frequency masks are often used to compute these SCMs. Most studies of mask-based beamforming have assumed that the sources do not move. However, sources often move in practice, which causes performance degradation. In this paper, we address the problem of mask-based beamforming for moving sources. We first review classical approaches to tracking a moving source, which perform online or blockwise computation of the SCMs. We show that these approaches can be interpreted as computing a sum of instantaneous SCMs weighted by attention weights. These weights indicate which time frames of the signal to consider in the SCM computation. Online or blockwise computation assumes a heuristic and deterministic way of computing these attention weights that, although simple, may not result in optimal performance. We thus introduce a learning-based framework that computes optimal attention weights for beamforming. We achieve this using a neural network implemented with self-attention layers. We show experimentally that our proposed framework can greatly improve beamforming performance in moving source situations while maintaining high performance in non-moving situations, thus enabling the development of mask-based beamformers robust to source movements.

下载PDF全文

下载文献需遵守相关版权规定

论文标题