用音频信息的预训练的语言模型解释歌曲歌词

论文标题

用音频信息的预训练的语言模型解释歌曲歌词

Interpreting Song Lyrics with an Audio-Informed Pre-trained Language Model

论文作者

Zhang, Yixiao, Jiang, Junyan, Xia, Gus, Dixon, Simon

论文摘要

抒情的解释可以帮助人们快速理解歌曲及其歌词，也可以使管理，检索和发现音乐档案越来越容易，从而更加容易地检索和发现歌曲。在本文中，我们提出了Bart-Fusion，这是一种新型模型，用于从歌词和音乐音频中生成歌词解释，该模型将大规模的预训练的语言模型与音频编码器结合在一起。我们采用跨模式的注意模块将音频表示形式纳入歌词表示形式，以帮助预先训练的语言模型从音频的角度了解歌曲，同时保留语言模型的原始生成性能。我们还发布了歌曲解释数据集，这是一个新的大型数据集，用于培训和评估我们的模型。实验结果表明，其他音频信息有助于我们的模型更好地理解单词和音乐，并产生精确而流利的解释。跨模式音乐检索的另一个实验表明，巴特融合产生的解释也可以帮助人们比原始的巴特更准确地检索音乐。

Lyric interpretations can help people understand songs and their lyrics quickly, and can also make it easier to manage, retrieve and discover songs efficiently from the growing mass of music archives. In this paper we propose BART-fusion, a novel model for generating lyric interpretations from lyrics and music audio that combines a large-scale pre-trained language model with an audio encoder. We employ a cross-modal attention module to incorporate the audio representation into the lyrics representation to help the pre-trained language model understand the song from an audio perspective, while preserving the language model's original generative performance. We also release the Song Interpretation Dataset, a new large-scale dataset for training and evaluating our model. Experimental results show that the additional audio information helps our model to understand words and music better, and to generate precise and fluent interpretations. An additional experiment on cross-modal music retrieval shows that interpretations generated by BART-fusion can also help people retrieve music more accurately than with the original BART.

下载PDF全文

下载文献需遵守相关版权规定

论文标题