视频域的音频自适应活动识别

论文标题

视频域的音频自适应活动识别

Audio-Adaptive Activity Recognition Across Video Domains

论文作者

Zhang, Yunhua, Doughty, Hazel, Shao, Ling, Snoek, Cees G. M.

论文摘要

本文努力在域移动下进行活动识别，例如，由于风景或摄像机的观点的改变而引起的。领先的方法通过对抗性训练和自学学习的学习来减少活动外观的转变。与这些以视觉为重点的作品不同，我们利用域适应性的活动声音，因为它们在域之间的差异较小，并且可以可靠地表明哪些活动没有发生。我们提出了一种音频自适应编码器和相关的学习方法，该方法可以区分视觉特征表示形式，并解决语义分布中的变化。为了进一步消除特定于域的特征并包括识别域不变的活动声音，提出了一种音频识别器，该声音有效地对跨域的跨模式相互作用进行了建模。我们还通过相应的视听数据集介绍了Actor Shift的新任务，以通过活动外观发生巨大变化的情况来挑战我们的方法。该数据集，Epic-Kitchens和Charadesego上的实验显示了我们方法的有效性。

This paper strives for activity recognition under domain shift, for example caused by change of scenery or camera viewpoint. The leading approaches reduce the shift in activity appearance by adversarial training and self-supervised learning. Different from these vision-focused works we leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening. We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation as well as addressing shifts in the semantic distribution. To further eliminate domain-specific features and include domain-invariant activity sounds for recognition, an audio-infused recognizer is proposed, which effectively models the cross-modal interaction across domains. We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically. Experiments on this dataset, EPIC-Kitchens and CharadesEgo show the effectiveness of our approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题