Face Recognition for Mass Events Photographs

# Face Recognition for Mass Events Photographs # Middle delivery ## Team 1. Vlad Kuleykin (v.kuleykin@innopolis.ru) 2. Boris Guryev (b.guryev@innopolis.ru) 3. Ivan Lyagaev (i.lyagaev@innopolis.ru) ## First delivery plan (from the proposal) Delivery of working prototype system. 1. Gather the test *artificial* dataset **(All)** 2. Face detection **(Vlad)** - Locating faces on the image 3. Face extraction **(Ivan)** - Extracting face features 4. Face recognition **(Boris)** - Comparison of known face feature vectors and unknown ## Dataset Initially, we planned to make several photos of us in university or to use our existing personal photo to try models inference on them. During development, we faced that for our purposes it is possible to collect real datasets from passed Innopolis events now. We plan to make a dataset with corner cases like light variation, significantly rotated faces, partially occurred faces in final delivery. For now, we collected: * 2 current photos of us for initial registration * Marked dataset with labels: Boris, Ivan, Vlad, No one  | Event | Boris | Ivan | Vlad | No one | | -------- | -------- | -------- |---- | ------ | | Club Fest (2018) | 9 | 0 | 12 | 256 | | Halloween (2018) | 0 | 0 | 0 | 587 | | Halloween (2017) | 6 | 0 | 6 | 453 | | Slippers of the year (2018) | 1 | 1 | 0 | 130 | | Slippers of the year (2019) | 7 | 3 | 0 | 430 | | Aerotube (2019) | 17 | 0 | 0 | 81 | | 23/8 (2017) | 5 | 0 | 4 | 225 | #### Our photos: ![](https://i.imgur.com/6XdVFKq.jpg) ## Face detection For this and future tasks we decided to use pre-trained models due to limitations in the time and resources. State of the art models requires learning on datasets of significant sizes for several epochs. It is a common practice even for industry projects to reuse pre-trained models. For face detection we selected [MTCNN](https://github.com/ipazc/mtcnn) face detector [[2]](https://arxiv.org/ftp/arxiv/papers/1604/1604.02878.pdf). This network performs three stages of bounding box filtering. Sections named *Proposal Network (P-Net)*, *Refinement Network (R-Net)*, *Output Network (O-Net)* respectively. Firstly, they produced different scales of the initial image using Pyramid Scaling. Next, candidate windows are produced through a fast Proposal Network (P-Net). After that, these candidates refined in the next stage through a Refinement Network (R-Net). In the third stage, the Output Network (O-Net) produces the final bounding box and facial landmarks position. ![](https://i.imgur.com/M28wvpI.png) During forward propagation, the output of each stage passed through non-maxima suppression and resizing algorithms. ![](https://i.imgur.com/fAj8cAm.png) This pipeline doesn't have a trivial loss function because there is the training of three different models. The authors mentioned that used combined loss function for detection, box coordinates, and landmark prediction penalty. Since they delegated different tasks for each CNN, every 3 parts of loss function were multiplied by different coefficients. First of all, the value of coefficients depends on the type of fed image (if the image without a face was passed, penalties were applied only to the detection part of the loss). Besides, the order of CNN in the global pipeline (P-Net and R-Net had a lower penalty for wrongly proposed box and landmark) was considered for coefficients selection. ## Feature extraction For feature extraction, we decided to use FaceNet [[1]](https://arxiv.org/pdf/1503.03832.pdf) and embedding for face images. This method uses a deep convolutional network to optimize resulted vector space, such that distances in this space correspond to a measure of face similarity. Previous approaches tried to solve the problem as a multi-label classification task. The training was done using a classification layer at the end of the network and later this layer was dropped to obtain embedding vectors. That approaches were based on the assumption that the training dataset contained enough identities for the network to generalize all existing identities in the world. FaceNet questions previous methods and proposes to learn a direct embedding without a classification layer using a triplet loss function. This function defines three points: anchor, positive and negative, where the anchor and positive share the same identity (have the same corresponding face) and anchor and negative share the different identity (have different corresponding faces). The goal of the triplet loss function is to minimize the distance between the anchor and positive and maximize the distance between the anchor and negative. ## Face recognition For now, we use the L2 norm to measure the similarity between two feature vectors. We selected exactly L2 due to it was used in the paper of the network we used for face features extraction [[1]](https://arxiv.org/pdf/1503.03832.pdf). In the future, we will decide if we should change the scoring metric. For that, we need to consider requirements to precision/recall and we planned to do it in final delivery. Another still opened question is how to find the optimal threshold. There are two cases: * Trivial - one threshold for everybody in the application. We faced drawbacks of this approach during training due to in some cases algorithm compute two persons to produce similar feature vector, hence they should have higher threshold both. * Variable threshold - algorithm using feedback from the user will adopt and recalculate the threshold of the separate users. ## Discussion #### Results As were mentioned above, we applied recognition to all photos from events. MTCNN and FaceNet were succeeded with detecting all of our faces we labeled and even did it with the photos we considered as insignificant during labeling due to occlusion or being too small (corner cases). We labeled one of these photos as "No one" label but it detected the correct face. Example of face detection on the photo with occluded faces: ![](https://i.imgur.com/4dzCi6U.jpg) The recognized face of Boris in this photo: ![](https://i.imgur.com/3Nl4xFC.jpg) #### Future work For now, we store produced feature vectors of detected faces in a binary file inside the folder of the relevant event. This is not the best approach and we are planning to tune it in the future. We decided to make this project useful for the Innopolis University community, so we plan to deploy results as a service that will be able to consume photos from ongoing events and register new users. During the discussion of work with our TA Alex, he recommended us to pay attention to the library with a face recognition model named ["face.evoLVe"](https://github.com/ZhaoJ9014/face.evoLVe.PyTorch). This library provides a comprehensive pipeline on face recognition (that also includes face detection part) including tools for training, finetuning and optimization. For detection part library contains a modification of MTCNN. We will consider the possibility of changing our current models to one of the proposed in the library. ## Second delivery Delivery of a fully working system. 1. ~~Gather the *real* dataset **(Ivan)**~~ 2. Gather the *corner cases* dataset **(Ivan)** 3. Complete pipeline **(All)** - Discuss/modify features storage - Person registration functional - Interface for adding photos from a new event 4. Testing and collecting metrics **(Vlad)** - Precision and Recall of complete pipeline 5. Starting an Open Source Project **(All)** - Publishing the work on GitHub 6. Deploy pipeline as a service **(Boris)** ## References [[1] FaceNet: A Unified Embedding for Face Recognition and Clustering. Florian Schroff, Dmitry Kalenichenko, James Philbin; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815-823](https://arxiv.org/pdf/1503.03832.pdf) [[2] Joint face detection and alignment using multitask cascaded convolutional networks, K Zhang, Z Zhang, Z Li, Y Qiao - IEEE Signal Processing Letters, 2016](https://arxiv.org/ftp/arxiv/papers/1604/1604.02878.pdf) [[3] Face.evoLVe: High-Performance Face Recognition Library based on PyTorch](https://github.com/ZhaoJ9014/face.evoLVe.PyTorch)