论文标题
Stavis:时空视听显着性网络
STAViS: Spatio-Temporal AudioVisual Saliency Network
论文作者
论文摘要
我们介绍了Stavis,这是一个时空的视听显着性网络,结合了时空的视觉和听觉信息,以有效地解决视频中显着性估计的问题。我们的方法采用了一个结合视觉显着性和听觉功能的单个网络,并学会了适当本地化的声音来源并融合两个Saliencies,以获得最终的显着性图。该网络是在六个不同的数据库上设计,训练有素的,并在包含各种视频的视听数据库上进行了评估。我们将方法与8种不同的最先进的视觉显着性模型进行了比较。跨数据库的评估结果表明,在大多数情况下,我们的Stavis模型优于我们的仅视觉变体以及其他最新模型。同样,它在所有数据库中取得的始终如一的性能始终表明,它适合估计“野外”的显着性。该代码可在https://github.com/atsiami/stavis上找到。
We introduce STAViS, a spatio-temporal audiovisual saliency network that combines spatio-temporal visual and auditory information in order to efficiently address the problem of saliency estimation in videos. Our approach employs a single network that combines visual saliency and auditory features and learns to appropriately localize sound sources and to fuse the two saliencies in order to obtain a final saliency map. The network has been designed, trained end-to-end, and evaluated on six different databases that contain audiovisual eye-tracking data of a large variety of videos. We compare our method against 8 different state-of-the-art visual saliency models. Evaluation results across databases indicate that our STAViS model outperforms our visual only variant as well as the other state-of-the-art models in the majority of cases. Also, the consistently good performance it achieves for all databases indicates that it is appropriate for estimating saliency "in-the-wild". The code is available at https://github.com/atsiami/STAViS.