论文标题
VRAG:基于内容的视频检索的区域注意图
VRAG: Region Attention Graphs for Content-Based Video Retrieval
论文作者
论文摘要
基于内容的视频检索(CBVR)在媒体共享平台上用于视频推荐和过滤等应用程序。为了管理扩展到数十亿个视频的数据库,由于其效率,首选使用固定尺寸嵌入的视频级别方法。在本文中,我们介绍了视频区域注意图网络(VRAG),以改善视频级别方法的最新技术。我们通过区域级特征代表更精细的粒度视频,并通过区域级别的关系编码视频时空动力学。我们的VRAG通过自我注意力和图形卷积的置换量汇总的置换量捕获了区域之间的关系。此外,我们表明,可以通过将视频分割为镜头并使用Shot Embeddings进行视频检索来减少视频级别和框架级别方法之间的性能差距。我们在几个视频检索任务上评估了VRAG,并实现了视频级检索的新最新。此外,我们的射击级VRAG比其他现有视频级别的方法显示出更高的检索精度,并且以更快的评估速度以更接近框架级别的方法。最后,我们的代码将公开可用。
Content-based Video Retrieval (CBVR) is used on media-sharing platforms for applications such as video recommendation and filtering. To manage databases that scale to billions of videos, video-level approaches that use fixed-size embeddings are preferred due to their efficiency. In this paper, we introduce Video Region Attention Graph Networks (VRAG) that improves the state-of-the-art of video-level methods. We represent videos at a finer granularity via region-level features and encode video spatio-temporal dynamics through region-level relations. Our VRAG captures the relationships between regions based on their semantic content via self-attention and the permutation invariant aggregation of Graph Convolution. In addition, we show that the performance gap between video-level and frame-level methods can be reduced by segmenting videos into shots and using shot embeddings for video retrieval. We evaluate our VRAG over several video retrieval tasks and achieve a new state-of-the-art for video-level retrieval. Furthermore, our shot-level VRAG shows higher retrieval precision than other existing video-level methods, and closer performance to frame-level methods at faster evaluation speeds. Finally, our code will be made publicly available.