视听显着性预测的双重域交流学习

论文标题

视听显着性预测的双重域交流学习

Dual Domain-Adversarial Learning for Audio-Visual Saliency Prediction

论文作者

Fan, Yingzi, Han, Longfei, Zhang, Yue, Cheng, Lechao, Xia, Chen, Hu, Di

论文摘要

视觉和听觉信息对于确定视频中的显着区域都是有价值的。深度卷积神经网络（CNN）展示了应对视听显着性预测任务的强大能力。由于各种因素，例如拍摄场景和天气，源训练数据和目标测试数据之间通常存在适度的分布差异。域差异导致CNN模型的目标测试数据的性能降解。本文提前尝试解决视听显着性预测的无监督域适应问题。我们提出了一种双重域交流学习算法，以减轻源数据和目标数据之间的域差异。首先，建立一个特定的域歧视分支，以对齐听觉功能分布。然后，这些听觉功能通过跨模式自我发场模块融合到视觉特征中。设计了其他域歧视分支，以减少视觉特征的域差异和融合视听特征所隐含的视听相关性的域差异。公共基准测试的实验表明，我们的方法可以减轻域差异引起的性能降低。

Both visual and auditory information are valuable to determine the salient regions in videos. Deep convolution neural networks (CNN) showcase strong capacity in coping with the audio-visual saliency prediction task. Due to various factors such as shooting scenes and weather, there often exists moderate distribution discrepancy between source training data and target testing data. The domain discrepancy induces to performance degradation on target testing data for CNN models. This paper makes an early attempt to tackle the unsupervised domain adaptation problem for audio-visual saliency prediction. We propose a dual domain-adversarial learning algorithm to mitigate the domain discrepancy between source and target data. First, a specific domain discrimination branch is built up for aligning the auditory feature distributions. Then, those auditory features are fused into the visual features through a cross-modal self-attention module. The other domain discrimination branch is devised to reduce the domain discrepancy of visual features and audio-visual correlations implied by the fused audio-visual features. Experiments on public benchmarks demonstrate that our method can relieve the performance degradation caused by domain discrepancy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题