Brouhaha：语音活动检测，语音与噪声比率和C50房间声学估计的多任务培训

论文标题

Brouhaha：语音活动检测，语音与噪声比率和C50房间声学估计的多任务培训

Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation

论文作者

Lavechin, Marvin, Métais, Marianne, Titeux, Hadrien, Boissonnet, Alodie, Copet, Jade, Rivière, Morgane, Bergelson, Elika, Cristia, Alejandrina, Dupoux, Emmanuel, Bredin, Hervé

论文摘要

当应用于嘈杂或回响的语音时，大多数自动语音处理系统会降级性能。但是，如何判断言语是嘈杂还是回响？我们提出了Brouhaha，该神经网络联合训练，可从单渠道录音中提取语音/非语音段，语音与噪声比例和C50房间声学。 Brouhaha是使用数据驱动的方法进行训练的，在该方法中，合成了嘈杂和回响的音频段。我们首先评估其性能，并证明所提出的多任务制度是有益的。然后，我们提出了两种情况，说明了如何将Brouhaha用于自然嘈杂和回响的数据：1）研究说话者诊断模型（Pyannote.audio）犯的错误； 2）评估自动语音识别模型的可靠性（OpenAI的耳语）。我们的管道和预估计的模型都是开源的，并与演讲社区共享。

Most automatic speech processing systems register degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a neural network jointly trained to extract speech/non-speech segments, speech-to-noise ratios, and C50room acoustics from single-channel recordings. Brouhaha is trained using a data-driven approach in which noisy and reverberant audio segments are synthesized. We first evaluate its performance and demonstrate that the proposed multi-task regime is beneficial. We then present two scenarios illustrating how Brouhaha can be used on naturally noisy and reverberant data: 1) to investigate the errors made by a speaker diarization model (pyannote.audio); and 2) to assess the reliability of an automatic speech recognition model (Whisper from OpenAI). Both our pipeline and a pretrained model are open source and shared with the speech community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题