论文标题
Brouhaha:语音活动检测,语音与噪声比率和C50房间声学估计的多任务培训
Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation
论文作者
论文摘要
当应用于嘈杂或回响的语音时,大多数自动语音处理系统会降级性能。但是,如何判断言语是嘈杂还是回响?我们提出了Brouhaha,该神经网络联合训练,可从单渠道录音中提取语音/非语音段,语音与噪声比例和C50房间声学。 Brouhaha是使用数据驱动的方法进行训练的,在该方法中,合成了嘈杂和回响的音频段。我们首先评估其性能,并证明所提出的多任务制度是有益的。然后,我们提出了两种情况,说明了如何将Brouhaha用于自然嘈杂和回响的数据:1)研究说话者诊断模型(Pyannote.audio)犯的错误; 2)评估自动语音识别模型的可靠性(OpenAI的耳语)。我们的管道和预估计的模型都是开源的,并与演讲社区共享。
Most automatic speech processing systems register degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a neural network jointly trained to extract speech/non-speech segments, speech-to-noise ratios, and C50room acoustics from single-channel recordings. Brouhaha is trained using a data-driven approach in which noisy and reverberant audio segments are synthesized. We first evaluate its performance and demonstrate that the proposed multi-task regime is beneficial. We then present two scenarios illustrating how Brouhaha can be used on naturally noisy and reverberant data: 1) to investigate the errors made by a speaker diarization model (pyannote.audio); and 2) to assess the reliability of an automatic speech recognition model (Whisper from OpenAI). Both our pipeline and a pretrained model are open source and shared with the speech community.