论文标题
在现场音乐视频流中,用于视听语音活动检测的规则包裹网络
Rule-embedded network for audio-visual voice activity detection in live musical video streams
论文作者
论文摘要
在现场音乐流中检测Anchor的声音是音乐和语音信号处理的重要预处理。现有的语音活动检测方法(VAD)主要依赖于音频,但是,基于音频的VAD很难在嘈杂的环境中有效地关注目标语音。在视觉信息的帮助下,本文提出了一个规则包裹的网络,以融合视听(A-V)输入,以帮助模型更好地检测目标语音。该规则在模型中的核心作用是协调双模式信息之间的关系,并将视觉表示形式用作掩盖,以滤除非目标声音的信息。实验表明:1)借助拟议规则的跨模式融合,A-V分支的检测结果优于音频分支的检测结果; 2)双模式模型的性能远远超过了仅音频模型的表现,这表明音频和视觉信号的结合对VAD非常有益。为了吸引人们对跨模式音乐和音频信号处理的更多关注,引入了带有框架级标签的新现场音乐视频语料库。
Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. With the help of visual information, this paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs to help the model better detect target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as the mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion by the proposed rule, the detection result of A-V branch outperforms that of audio branch; 2) the performance of bi-modal model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level label is introduced.