双耳：双耳音频合成的两阶段条件扩散概率模型

论文标题

双耳：双耳音频合成的两阶段条件扩散概率模型

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

论文作者

Leng, Yichong, Chen, Zehua, Guo, Junliang, Liu, Haohe, Chen, Jiawei, Tan, Xu, Mandic, Danilo, He, Lei, Li, Xiang-Yang, Qin, Tao, Zhao, Sheng, Liu, Tie-Yan

论文摘要

双耳音频在构建沉浸式增强和虚拟现实中起着重要作用。由于记录现实世界中的双耳音频很昂贵，因此从单声道中综合它们引起了人们的关注。这个合成过程不仅涉及单声音音频的基本物理翘曲，还涉及房间的混响和与头部/耳朵相关的过滤，但是，在传统的数字信号处理中很难准确模拟。在本文中，我们通过将双耳音频分解为由左右通道共享的共同部分以及每个通道中不同的特定部分，从不同的角度从不同的角度提出了综合过程。因此，我们提出了Binaurgrad，这是一个新型的两阶段框架，配备了扩散模型，可以分别合成它们。具体而言，在第一阶段，双耳音频的共同信息是由单渠道扩散模型在单声道音频上产生的，基于第二阶段的两通道扩散模型产生双耳音频。拟议的双uralgrad结合了两阶段合成的新透视图（即扩散模型），能够生成准确且高效率的双耳音频样品。实验结果表明，在基准数据集上，双耳格拉德在对象和受试者评估指标方面的优于现有基准的优于现有基准（波L2：0.128 vs. 0.157，MOS：3.80 vs. 3.61）。生成的音频样品（https://spechresearch.github.io/binauralgrad）和代码（https://github.com/microsoft/neuralspeech/tree/tree/master/master/binauralgrad）在线获得。

Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we formulate the synthesis process from a different perspective by decomposing the binaural audio into a common part that shared by the left and right channels as well as a specific part that differs in each channel. Accordingly, we propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Specifically, in the first stage, the common information of the binaural audio is generated with a single-channel diffusion model conditioned on the mono audio, based on which the binaural audio is generated by a two-channel diffusion model in the second stage. Combining this novel perspective of two-stage synthesis with advanced generative models (i.e., the diffusion models),the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples. Experiment results show that on a benchmark dataset, BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics (Wave L2: 0.128 vs. 0.157, MOS: 3.80 vs. 3.61). The generated audio samples (https://speechresearch.github.io/binauralgrad) and code (https://github.com/microsoft/NeuralSpeech/tree/master/BinauralGrad) are available online.

下载PDF全文

下载文献需遵守相关版权规定

论文标题