视频中的独特面孔识别

论文标题

视频中的独特面孔识别

Unique Faces Recognition in Videos

论文作者

Huo, Jiahao, van Zyl, Terence L

论文摘要

本文在采用公制学习方法和相似性排名模型的视频中解决了面部识别。本文将暹罗网络与对比度损失和三重态网络的使用与实现以下体系结构的三重损失的使用：Google/Inception Architecture，3D卷积网络（C3D）和2-D长的短期记忆（LSTM）复发性神经网络。我们利用视频中的静止图像和序列来训练网络并比较实现上述体系结构的性能。使用的数据集是YouTube面部数据库，旨在调查视频中面部识别问题。本文的贡献是两个方面的贡献：首先，实验已经建立了3-D卷积网络和2-D LSTM，图像序列的对比损失并不能超过Google/Intection架构，并且在顶部$ n $等级的面对面的结果中，带有静止图像的对比度损失。但是，3D卷积网络和带有三重速度损失的2-D LSTM优于Google/Intection，Triplet损失在数据集中的顶部$ N $等级的Face fearerive;其次，将支持向量机（SVM）与CNNS学习的特征表示面部识别一起使用。结果表明，与对比损失相比，N-Shot面部识别的特征表示对N-Shot面部识别的损失明显更好。面部识别的最有用的特征表示是来自三重损失的2-D LSTM。实验表明，从视频序列中学习时空特征对视频中的面部识别是有益的。

This paper tackles face recognition in videos employing metric learning methods and similarity ranking models. The paper compares the use of the Siamese network with contrastive loss and Triplet Network with triplet loss implementing the following architectures: Google/Inception architecture, 3D Convolutional Network (C3D), and a 2-D Long short-term memory (LSTM) Recurrent Neural Network. We make use of still images and sequences from videos for training the networks and compare the performances implementing the above architectures. The dataset used was the YouTube Face Database designed for investigating the problem of face recognition in videos. The contribution of this paper is two-fold: to begin, the experiments have established 3-D Convolutional networks and 2-D LSTMs with the contrastive loss on image sequences do not outperform Google/Inception architecture with contrastive loss in top $n$ rank face retrievals with still images. However, the 3-D Convolution networks and 2-D LSTM with triplet Loss outperform the Google/Inception with triplet loss in top $n$ rank face retrievals on the dataset; second, a Support Vector Machine (SVM) was used in conjunction with the CNNs' learned feature representations for facial identification. The results show that feature representation learned with triplet loss is significantly better for n-shot facial identification compared to contrastive loss. The most useful feature representations for facial identification are from the 2-D LSTM with triplet loss. The experiments show that learning spatio-temporal features from video sequences is beneficial for facial recognition in videos.

下载PDF全文

下载文献需遵守相关版权规定

论文标题