神经傅立叶旋转语音渲染

论文标题

神经傅立叶旋转语音渲染

Neural Fourier Shift for Binaural Speech Rendering

论文作者

Lee, Jin Woo, Lee, Kyogu

论文摘要

我们提出了一个神经网络，用于从给定的单声音音频，位置和源代表的双耳语音。以前的大多数作品都集中于通过调节卷积神经网络特征空间的位置和方向来综合双耳语音。这些综合方法在估计目标双耳语音的估计中也很强大，即使是野外数据，但很难从分布域中呈现音频。为了减轻这一点，我们提出了神经傅立叶转移（NFS），这是一种新型的网络架构，可以在傅立叶空间中进行双耳语音呈现。具体而言，使用基于源和接收器之间的距离的几何时间延迟，NFS经过训练，以预测各种早期反射的延迟和尺度。 NFS在内存和计算成本方面都是有效的，可以解释，并且通过其设计独立于源域操作。实验结果表明，NFS的性能与基准数据集的先前研究相当，即使其记忆力较轻，计算较少6倍。

We present a neural network for rendering binaural speech from given monaural audio, position, and orientation of the source. Most of the previous works have focused on synthesizing binaural speeches by conditioning the positions and orientations in the feature space of convolutional neural networks. These synthesis approaches are powerful in estimating the target binaural speeches even for in-the-wild data but are difficult to generalize for rendering the audio from out-of-distribution domains. To alleviate this, we propose Neural Fourier Shift (NFS), a novel network architecture that enables binaural speech rendering in the Fourier space. Specifically, utilizing a geometric time delay based on the distance between the source and the receiver, NFS is trained to predict the delays and scales of various early reflections. NFS is efficient in both memory and computational cost, is interpretable, and operates independently of the source domain by its design. Experimental results show that NFS performs comparable to the previous studies on the benchmark dataset, even with its 25 times lighter memory and 6 times fewer calculations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题