使用音频事件线索的自动音频字幕

论文标题

使用音频事件线索的自动音频字幕

Automated Audio Captioning using Audio Event Clues

论文作者

Eren, Ayşegül Özkaya, Sert, Mustafa

论文摘要

音频字幕是一个重要的研究领域，旨在为音频剪辑生成有意义的描述。大多数现有研究都将音频剪辑的声学特征提取为编码器编码器和变压器体系结构的输入，以序列到序列的方式产生字幕。由于数据不足和体系结构的学习能力不足，需要其他信息来产生自然语言句子以及声学特征。为了解决这些问题，提出了一个编码器架构，该体系结构从两个声学特征和提取的音频事件标签中学习为输入。所提出的模型基于预先训练的声学特征和音频事件检测。各种实验使用了不同的声学特征，单词嵌入模型，音频事件标签提取方法和实现配置，以显示哪些组合在音频字幕任务上具有更好的性能。多个数据集上广泛的实验的结果表明，使用声音特征的音频事件标签可改善识别性能，而所提出的方法要优于最先进的模型，因此提出的方法优于竞争结果。

Audio captioning is an important research area that aims to generate meaningful descriptions for audio clips. Most of the existing research extracts acoustic features of audio clips as input to encoder-decoder and transformer architectures to produce the captions in a sequence-to-sequence manner. Due to data insufficiency and the architecture's inadequate learning capacity, additional information is needed to generate natural language sentences, as well as acoustic features. To address these problems, an encoder-decoder architecture is proposed that learns from both acoustic features and extracted audio event labels as inputs. The proposed model is based on pre-trained acoustic features and audio event detection. Various experiments used different acoustic features, word embedding models, audio event label extraction methods, and implementation configurations to show which combinations have better performance on the audio captioning task. Results of the extensive experiments on multiple datasets show that using audio event labels with the acoustic features improves the recognition performance and the proposed method either outperforms or achieves competitive results with the state-of-the-art models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题