论文标题
Wavetransformer:基于学习时间和时频信息的新型音频字幕架构
WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information
论文作者
论文摘要
自动音频字幕(AAC)是一项新任务,其中一种方法作为输入音频样本并输出其内容的文本描述(即标题)。大多数AAC方法来自机器翻译字段的图像字幕。在这项工作中,我们提出了一种新颖的AAC小说方法,明确地侧重于对音频中时间和时频模式的开发。我们采用三个可学习的过程来编码音频,两个用于提取本地和时间信息,一个用于合并前两个过程的输出。为了生成标题,我们采用了广泛使用的变压器解码器。我们利用Clotho数据集的免费分裂评估了我们的方法。我们的结果先前报告的最高蜘蛛从16.2增加到17.3。
Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from from image captioning of machine translation fields. In this work we present a novel AAC novel method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the local and temporal information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2.