AV凝视：关于音频指导视觉注意力估计的非企业面孔的视觉注意力估计的研究

论文标题

AV凝视：关于音频指导视觉注意力估计的非企业面孔的视觉注意力估计的研究

AV-Gaze: A Study on the Effectiveness of Audio Guided Visual Attention Estimation for Non-Profilic Faces

论文作者

Ghosh, Shreya, Dhall, Abhinav, Hayat, Munawar, Knibbe, Jarrod

论文摘要

在充满挑战的现实生活条件下，例如极端的头置，遮挡和低分辨率图像，视觉信息无法估算视觉注意力/凝视方向，音频信号可以提供重要和互补的信息。在本文中，我们探讨了音频引导的粗置姿势是否可以进一步提高非独裁面孔的视觉注意力估计性能。由于很难注释音频信号来估计说话者的头置姿势，因此我们使用现成的最先进的模型来促进跨模式的弱点。在训练阶段，该框架从同步的视听方式中学习了互补的信息。我们的模型可以利用任何可用的模式，即用于特定于任务的推断的音频，视觉或视听。有趣的是，当用这些特定方式在基准数据集上测试AV凝视时，它会在多个数据集上实现竞争成果，同时非常适应充满挑战的情况。

In challenging real-life conditions such as extreme head-pose, occlusions, and low-resolution images where the visual information fails to estimate visual attention/gaze direction, audio signals could provide important and complementary information. In this paper, we explore if audio-guided coarse head-pose can further enhance visual attention estimation performance for non-prolific faces. Since it is difficult to annotate audio signals for estimating the head-pose of the speaker, we use off-the-shelf state-of-the-art models to facilitate cross-modal weak-supervision. During the training phase, the framework learns complementary information from synchronized audio-visual modality. Our model can utilize any of the available modalities i.e. audio, visual or audio-visual for task-specific inference. It is interesting to note that, when AV-Gaze is tested on benchmark datasets with these specific modalities, it achieves competitive results on multiple datasets, while being highly adaptive toward challenging scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题