视听零拍学习的时间和跨模式的关注

论文标题

视听零拍学习的时间和跨模式的关注

Temporal and cross-modal attention for audio-visual zero-shot learning

论文作者

Mercea, Otniel-Bogdan, Hummel, Thomas, Koepke, A. Sophia, Akata, Zeynep

论文摘要

视频分类的视听通用零拍学习需要了解音频和视觉信息之间的关系，以便能够在测试时识别出新颖的，以前看不见的类别的样本。可以利用视频数据中音频和视觉数据之间的自然语义和时间对齐，以学习在测试时概括以概括为了看不见类的强大表示。我们为音频概括的零拍学习提出了一个多模式和时间跨注意框架（\ modelname）。它的输入是从预先训练的网络获得的时间对齐音频和视觉功能。鼓励该框架专注于跨时间的跨模式对应关系，而不是在模式中的自我注意力，从而显着提高了表现。我们表明，我们提出的框架摄入时间功能可在\ ucf，\ vgg和\ \ \活动基准上产生最新的性能，以进行（概括）零弹性学习。复制所有结果的代码可在\ url {https://github.com/explainableml/tcaf-gzsl}上获得。

Audio-visual generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information in order to be able to recognise samples from novel, previously unseen classes at test time. The natural semantic and temporal alignment between audio and visual data in video data can be exploited to learn powerful representations that generalise to unseen classes at test time. We propose a multi-modal and Temporal Cross-attention Framework (\modelName) for audio-visual generalised zero-shot learning. Its inputs are temporally aligned audio and visual features that are obtained from pre-trained networks. Encouraging the framework to focus on cross-modal correspondence across time instead of self-attention within the modalities boosts the performance significantly. We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning. Code for reproducing all results is available at \url{https://github.com/ExplainableML/TCAF-GZSL}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题