基于视频的人重新识别的时间属性 - 出现学习网络

论文标题

基于视频的人重新识别的时间属性 - 出现学习网络

Temporal Attribute-Appearance Learning Network for Video-based Person Re-Identification

论文作者

Liu, Jiawei, Zhu, Xierong, Zha, Zheng-Jun

论文摘要

基于视频的人的重新识别旨在在不同时间和位置的监视视频中与特定的行人匹配。人类的属性和外观是彼此互补的，两者都有助于行人匹配。在这项工作中，我们为基于视频的人重新识别的新型时间属性学习网络（TALNET）提出了新颖的时间属性学习网络（Talnet）。 Talnet同时利用人类的属性和外观来从视频中学习全面有效的行人代表。它探讨了属性的艰难视觉注意力和时间语义上下文，以及身体部位之间的空间依赖性，以增强其学习。具体而言，提出了一个属性分支网络，该网络是通过空间注意块和用于学习鲁棒属性表示的时间语义上下文块。空间注意力块将网络重点放在与每个属性相关的视频框架内的相应区域上，时间语义上下文块在每个视频框架中既了解跨视频框架的每个属性的时间上下文，又了解语义上下文。外观分支网络旨在从整个身体和身体部位学习有效的外观表示，其中包括空间依赖性。 Talnet利用属性和外观表示之间的互补，并通过多任务学习方式共同优化它们。此外，我们在两个常用的视频数据集中为每个行人注释了ID级属性。在这些数据集上进行了广泛的实验，已经验证了tallnet优于最先进的方法。

Video-based person re-identification aims to match a specific pedestrian in surveillance videos across different time and locations. Human attributes and appearance are complementary to each other, both of them contribute to pedestrian matching. In this work, we propose a novel Temporal Attribute-Appearance Learning Network (TALNet) for video-based person re-identification. TALNet simultaneously exploits human attributes and appearance to learn comprehensive and effective pedestrian representations from videos. It explores hard visual attention and temporal-semantic context for attributes, and spatial-temporal dependencies among body parts for appearance, to boost the learning of them. Specifically, an attribute branch network is proposed with a spatial attention block and a temporal-semantic context block for learning robust attribute representation. The spatial attention block focuses the network on corresponding regions within video frames related to each attribute, the temporal-semantic context block learns both the temporal context for each attribute across video frames and the semantic context among attributes in each video frame. The appearance branch network is designed to learn effective appearance representation from both whole body and body parts with spatial-temporal dependencies among them. TALNet leverages the complementation between attribute and appearance representations, and jointly optimizes them by multi-task learning fashion. Moreover, we annotate ID-level attributes for each pedestrian in the two commonly used video datasets. Extensive experiments on these datasets, have verified the superiority of TALNet over state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题