使用时间戳信息改善目标声音提取

论文标题

使用时间戳信息改善目标声音提取

Improving Target Sound Extraction with Timestamp Information

论文作者

Wang, Helin, Yang, Dongchao, Weng, Chao, Yu, Jianwei, Zou, Yuexian

论文摘要

目标声音提取（TSE）旨在从带有多个声音事件的混合音频中提取目标声音事件类的声音部分。以前的作品主要关注弱标记数据，共同学习和新课程的问题，但是，没有人关心目标声音事件的发作和抵消时间，这在听觉场景分析中已被强调。在本文中，我们研究利用此类时间戳信息来通过目标声音检测网络和目标加权时频率损耗函数来帮助提取目标声音。更具体地说，我们使用目标声音检测（TSD）网络的检测结果作为指导目标声音提取网络的学习。我们还发现，TSE的结果可以进一步提高TSD网络的性能，从而提出了目标声音检测和提取的相互学习框架。此外，目标加权时频率损失函数旨在更加关注训练期间目标声音的时间区域。关于从自由数据集生成的合成数据的实验结果表明，我们提出的方法可以显着提高TSE的性能。

Target sound extraction (TSE) aims to extract the sound part of a target sound event class from a mixture audio with multiple sound events. The previous works mainly focus on the problems of weakly-labelled data, jointly learning and new classes, however, no one cares about the onset and offset times of the target sound event, which has been emphasized in the auditory scene analysis. In this paper, we study to utilize such timestamp information to help extract the target sound via a target sound detection network and a target-weighted time-frequency loss function. More specifically, we use the detection result of a target sound detection (TSD) network as the additional information to guide the learning of target sound extraction network. We also find that the result of TSE can further improve the performance of the TSD network, so that a mutual learning framework of the target sound detection and extraction is proposed. In addition, a target-weighted time-frequency loss function is designed to pay more attention to the temporal regions of the target sound during training. Experimental results on the synthesized data generated from the Freesound Datasets show that our proposed method can significantly improve the performance of TSE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题