与对比度学习的自动音频字幕的交互式音频文本表示

论文标题

与对比度学习的自动音频字幕的交互式音频文本表示

Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning

论文作者

Chen, Chen, Hou, Nana, Hu, Yuchen, Zou, Heqing, Qi, Xiaofeng, Chng, Eng Siong

论文摘要

自动音频字幕（AAC）是一项跨模式任务，生成自然语言来描述输入音频的内容。大多数先前的作品通常提取单模式的声学特征，因此对于跨模式解码任务来说是最佳的。在这项工作中，我们提出了一种称为Clip-AAC的新型AAC系统，以学习声音和文本信息的交互式跨模式表示。具体而言，所提出的夹子AAC在预训练的编码器中引入了音频头和文本头，以提取音频文本信息。此外，我们还采用对比度学习来通过学习音频信号及其配对字幕之间的对应关系来缩小域差异。实验结果表明，在NLP评估指标方面，所提出的CLIP-AAC方法通过Clotho数据集的显着余量超过了最佳基线。消融研究表明，预训练的模型和对比度学习都有助于AAC模型的性能增益。

Automated Audio captioning (AAC) is a cross-modal task that generates natural language to describe the content of input audio. Most prior works usually extract single-modality acoustic features and are therefore sub-optimal for the cross-modal decoding task. In this work, we propose a novel AAC system called CLIP-AAC to learn interactive cross-modality representation with both acoustic and textual information. Specifically, the proposed CLIP-AAC introduces an audio-head and a text-head in the pre-trained encoder to extract audio-text information. Furthermore, we also apply contrastive learning to narrow the domain difference by learning the correspondence between the audio signal and its paired captions. Experimental results show that the proposed CLIP-AAC approach surpasses the best baseline by a significant margin on the Clotho dataset in terms of NLP evaluation metrics. The ablation study indicates that both the pre-trained model and contrastive learning contribute to the performance gain of the AAC model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题