在线视频实例通过强大上下文融合进行细分

论文标题

在线视频实例通过强大上下文融合进行细分

Online Video Instance Segmentation via Robust Context Fusion

论文作者

Li, Xiang, Wang, Jinglu, Xu, Xiaohao, Raj, Bhiksha, Lu, Yan

论文摘要

视频实例分割（VIS）旨在在视频序列中对对象实例进行分类，分割和跟踪。最近基于变压器的神经网络证明了它们为VIS任务建模时空相关性的强大能力。依靠视频或剪辑级输入，它们的潜伏期和计算成本很高。我们提出了一个强大的上下文融合网络，以在线方式解决VIS，该网络可以预测实例通过前几个框架进行逐帧的细分框架。为了有效地获取每个帧的精确和时间一致的预测，关键思想是将有效和紧凑的上下文从参考框架融合到目标框架中。考虑到参考和目标框架对目标预测的不同影响，我们首先通过重要性意识的压缩总结上下文特征。采用变压器编码器来融合压缩上下文。然后，我们利用嵌入订单的实例来传达身份感知信息，并将身份与预测的实例掩码相对应。我们证明，我们强大的融合网络在现有的在线VIS方法中取得了最佳性能，并且比以前在YouTube-VIS 2019和2021基准上发布的剪辑级方法更好。此外，视觉对象通常具有声学签名，这些签名自然与录音录像中自然同步。通过利用上下文融合网络在多模式数据上的灵活性，我们进一步研究了音频对视频密集预测任务的影响，这在现有作品中从未进行过。我们建立了一个视听实例分割数据集，并证明野外场景中的声学信号可以使VIS任务受益。

Video instance segmentation (VIS) aims at classifying, segmenting and tracking object instances in video sequences. Recent transformer-based neural networks have demonstrated their powerful capability of modeling spatio-temporal correlations for the VIS task. Relying on video- or clip-level input, they suffer from high latency and computational cost. We propose a robust context fusion network to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames. To acquire the precise and temporal-consistent prediction for each frame efficiently, the key idea is to fuse effective and compact context from reference frames into the target frame. Considering the different effects of reference and target frames on the target prediction, we first summarize contextual features through importance-aware compression. A transformer encoder is adopted to fuse the compressed context. Then, we leverage an order-preserving instance embedding to convey the identity-aware information and correspond the identities to predicted instance masks. We demonstrate that our robust fusion network achieves the best performance among existing online VIS methods and is even better than previously published clip-level methods on the Youtube-VIS 2019 and 2021 benchmarks. In addition, visual objects often have acoustic signatures that are naturally synchronized with them in audio-bearing video recordings. By leveraging the flexibility of our context fusion network on multi-modal data, we further investigate the influence of audios on the video-dense prediction task, which has never been discussed in existing works. We build up an Audio-Visual Instance Segmentation dataset, and demonstrate that acoustic signals in the wild scenarios could benefit the VIS task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题