半监督的学习使用教师模型进行声音旋律提取

论文标题

半监督的学习使用教师模型进行声音旋律提取

Semi-supervised learning using teacher-student models for vocal melody extraction

论文作者

Kum, Sangeun, Lin, Jing-Hua, Su, Li, Nam, Juhan

论文摘要

缺乏标记数据是许多音乐信息检索任务（例如旋律提取）的主要障碍，例如旋律提取，标签非常费力或昂贵。半监督学习（SSL）通过利用大量未标记的数据来提供解决问题的解决方案。在本文中，我们提出了一种使用教师研究模型进行声音旋律提取的SSL方法。教师模型已通过标记的数据进行了预训练，并指导学生模型在自训练环境中未标记的输入中做出相同的预测。我们检查了具有不同数据增强方案和损失功能的三个教师模型的设置。同样，考虑到测试阶段标记的数据的稀缺性，我们使用分析合成方法人为地通过未标记数据的音调标签人为地生成大规模测试数据。结果表明，SSL方法大大提高了针对监督学习的绩效，并且改进取决于教师学生的模型，未标记的数据的大小，自我训练的迭代次数和其他培训细节。我们还发现，必须确保未标记的音频具有声音零件。最后，我们表明所提出的SSL方法使基线卷积复发性神经网络模型实现了与最先进的性能相当的性能。

The lack of labeled data is a major obstacle in many music information retrieval tasks such as melody extraction, where labeling is extremely laborious or costly. Semi-supervised learning (SSL) provides a solution to alleviate the issue by leveraging a large amount of unlabeled data. In this paper, we propose an SSL method using teacher-student models for vocal melody extraction. The teacher model is pre-trained with labeled data and guides the student model to make identical predictions given unlabeled input in a self-training setting. We examine three setups of teacher-student models with different data augmentation schemes and loss functions. Also, considering the scarcity of labeled data in the test phase, we artificially generate large-scale testing data with pitch labels from unlabeled data using an analysis-synthesis method. The results show that the SSL method significantly increases the performance against supervised learning only and the improvement depends on the teacher-student models, the size of unlabeled data, the number of self-training iterations, and other training details. We also find that it is essential to ensure that the unlabeled audio has vocal parts. Finally, we show that the proposed SSL method enables a baseline convolutional recurrent neural network model to achieve performance comparable to state-of-the-arts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题