在现场音乐视频流中，用于视听语音活动检测的规则包裹网络

论文标题

在现场音乐视频流中，用于视听语音活动检测的规则包裹网络

Rule-embedded network for audio-visual voice activity detection in live musical video streams

论文作者

Hou, Yuanbo, Deng, Yi, Zhu, Bilei, Ma, Zejun, Botteldooren, Dick

论文摘要

在现场音乐流中检测Anchor的声音是音乐和语音信号处理的重要预处理。现有的语音活动检测方法（VAD）主要依赖于音频，但是，基于音频的VAD很难在嘈杂的环境中有效地关注目标语音。在视觉信息的帮助下，本文提出了一个规则包裹的网络，以融合视听（A-V）输入，以帮助模型更好地检测目标语音。该规则在模型中的核心作用是协调双模式信息之间的关系，并将视觉表示形式用作掩盖，以滤除非目标声音的信息。实验表明：1）借助拟议规则的跨模式融合，A-V分支的检测结果优于音频分支的检测结果； 2）双模式模型的性能远远超过了仅音频模型的表现，这表明音频和视觉信号的结合对VAD非常有益。为了吸引人们对跨模式音乐和音频信号处理的更多关注，引入了带有框架级标签的新现场音乐视频语料库。

Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. With the help of visual information, this paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs to help the model better detect target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as the mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion by the proposed rule, the detection result of A-V branch outperforms that of audio branch; 2) the performance of bi-modal model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level label is introduced.

下载PDF全文

下载文献需遵守相关版权规定

论文标题