论文标题

捕捉合唱,经文,简介或其他任何东西:分析具有结构功能的歌曲

To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions

论文作者

Wang, Ju-Chiang, Hung, Yun-Ning, Smith, Jordan B. L.

论文摘要

传统的音乐结构分析算法旨在将歌曲分为片段,并将它们与抽象标签(例如'a'','b'和'c'分组)。但是,很少尝试明确识别每个段(例如“经文”或“合唱”)的功能,但有许多应用程序。我们引入了一个多任务深度学习框架,通过估计“版本”,“合唱”等,直接从音频中对这些结构语义标签进行建模,依此类推。我们提出了7级分类法(即,介绍,诗歌,合唱,桥梁,桥梁,器乐和沉默),并提供了合并四个不同数据集的注释的规则。我们还建议使用基于频谱变压器的模型(称为SpectNT),该模型可以通过附加的连接派时间定位(CTL)损失进行训练。在使用四个公共数据集的跨数据库评估中,我们证明了SPECTNT模型和CTL损失的有效性,并获得了总体上的效果:拟议的系统在检测合唱和边界时分别优于最先进的合唱检测和边界检测方法。

Conventional music structure analysis algorithms aim to divide a song into segments and to group them with abstract labels (e.g., 'A', 'B', and 'C'). However, explicitly identifying the function of each segment (e.g., 'verse' or 'chorus') is rarely attempted, but has many applications. We introduce a multi-task deep learning framework to model these structural semantic labels directly from audio by estimating "verseness," "chorusness," and so forth, as a function of time. We propose a 7-class taxonomy (i.e., intro, verse, chorus, bridge, outro, instrumental, and silence) and provide rules to consolidate annotations from four disparate datasets. We also propose to use a spectral-temporal Transformer-based model, called SpecTNT, which can be trained with an additional connectionist temporal localization (CTL) loss. In cross-dataset evaluations using four public datasets, we demonstrate the effectiveness of the SpecTNT model and CTL loss, and obtain strong results overall: the proposed system outperforms state-of-the-art chorus-detection and boundary-detection methods at detecting choruses and boundaries, respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源