论文标题
基于视频的面部识别中无监督域适应的双三键指标学习
Dual-Triplet Metric Learning for Unsupervised Domain Adaptation in Video-Based Face Recognition
论文作者
论文摘要
深度学习模型的可伸缩性和复杂性仍然是许多视觉识别应用程序中的关键问题,例如视频监视,其中需要使用每个新相机的标记图像数据进行微调,以减少从源域中捕获的视频之间的域移动,例如,实验室环境和目标域,即操作环境,例如实验室设置和目标域。在许多视频监视应用程序(例如面部识别(FR)和人重新识别)中,配对匹配器用于将使用摄像机捕获的查询图像分配给画廊中的相应参考图像。摄像机的不同配置和操作条件可以在配对距离分布中引入重大变化,从而导致新相机的识别性能降低。在本文中,提出了一种新的深层域适应(DA)方法,以使用带有新型摄像机捕获的未标记的轨迹来调整暹罗网络的CNN嵌入。为此,引入了双三键损失,以用于公制学习,其中使用来自源摄像头的视频数据和一个新的目标摄像头构建了两个三重态。为了构成双重三胞胎,引入了一种相互监督的学习方法,源相机充当老师,为目标摄像头提供初始嵌入。然后,学生依靠老师对在初始摄像机校准期间收集的正面和负面对迭代标记。来源和目标嵌入都继续同时学习,以使它们的成对距离分布变得一致。为了进行验证,建议的公制学习技术用于在不同的培训方案下训练深层的暹罗网络,并将其与COX-S2V上仍可访问的Video FR的最新技术和基于私人视频的FR数据集进行了比较。
The scalability and complexity of deep learning models remains a key issue in many of visual recognition applications like, e.g., video surveillance, where fine tuning with labeled image data from each new camera is required to reduce the domain shift between videos captured from the source domain, e.g., a laboratory setting, and the target domain, i.e, an operational environment. In many video surveillance applications, like face recognition (FR) and person re-identification, a pair-wise matcher is used to assign a query image captured using a video camera to the corresponding reference images in a gallery. The different configurations and operational conditions of video cameras can introduce significant shifts in the pair-wise distance distributions, resulting in degraded recognition performance for new cameras. In this paper, a new deep domain adaptation (DA) method is proposed to adapt the CNN embedding of a Siamese network using unlabeled tracklets captured with a new video cameras. To this end, a dual-triplet loss is introduced for metric learning, where two triplets are constructed using video data from a source camera, and a new target camera. In order to constitute the dual triplets, a mutual-supervised learning approach is introduced where the source camera acts as a teacher, providing the target camera with an initial embedding. Then, the student relies on the teacher to iteratively label the positive and negative pairs collected during, e.g., initial camera calibration. Both source and target embeddings continue to simultaneously learn such that their pair-wise distance distributions become aligned. For validation, the proposed metric learning technique is used to train deep Siamese networks under different training scenarios, and is compared to state-of-the-art techniques for still-to-video FR on the COX-S2V and a private video-based FR dataset.