论文标题
双重时间内存网络,用于有效的视频对象细分
Dual Temporal Memory Network for Efficient Video Object Segmentation
论文作者
论文摘要
视频对象分割(VOS)通常在半监督的设置中配制。鉴于第一帧上的地面真相分割掩码,VOS的任务是在像素级别的视频级别的其余框架中跟踪和分割单个或多个感兴趣的对象。 VOS中的基本挑战之一是如何最大程度地利用时间信息来提高性能。我们提出了一个端到端网络,该网络存储在当前框架之前的短期和长期视频序列信息,作为解决VOS中时间建模的时间记忆。我们的网络由两个时间子网络组成,包括短期内存子网络和长期内存子网络。短期内存子网络通过基于图的学习框架在视频中相邻框架之间的局部区域之间进行了细粒度的空间 - 周期性相互作用,该框架可以很好地保留随着时间的推移的局部区域的视觉一致性。长期内存子网络通过简化的门控复发单元(S-GRU)对象对物体的长距离演变进行建模,从而使分割具有牢固的遮挡和漂移误差。在我们的实验中,我们表明我们提出的方法在三个经常使用的VOS数据集上取得了良好且具有竞争力的性能,包括戴维斯2016,戴维斯2017和YouTube-Vos,以速度和准确性方面。
Video Object Segmentation (VOS) is typically formulated in a semi-supervised setting. Given the ground-truth segmentation mask on the first frame, the task of VOS is to track and segment the single or multiple objects of interests in the rest frames of the video at the pixel level. One of the fundamental challenges in VOS is how to make the most use of the temporal information to boost the performance. We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories to address the temporal modeling in VOS. Our network consists of two temporal sub-networks including a short-term memory sub-network and a long-term memory sub-network. The short-term memory sub-network models the fine-grained spatial-temporal interactions between local regions across neighboring frames in video via a graph-based learning framework, which can well preserve the visual consistency of local regions over time. The long-term memory sub-network models the long-range evolution of object via a Simplified-Gated Recurrent Unit (S-GRU), making the segmentation be robust against occlusions and drift errors. In our experiments, we show that our proposed method achieves a favorable and competitive performance on three frequently-used VOS datasets, including DAVIS 2016, DAVIS 2017 and Youtube-VOS in terms of both speed and accuracy.