# Few Shot Facial Recognition 
<p align="center">
<img src="https://i.imgur.com/mcHtn5R.png">
</p>
<!-- 

 -->
<!-- ## Table of Contents -->
<!-- [TOC] -->
## What is this ?
Utilizing FaceNet pre-train model, we try to solve the facialrecogintion
problem with only 5 images for each person.
The need for capable facial recognition systems is
increasing every day, as new applications for the technology
are found - from security and surveillance to self-driving cars.
One unique requirement for facial recognition that sets it apart from other AI problems is the relatively small datasets it is
expected to work with; anyone who needs a facial recognition
system to recognize them would not be willing to provide
more than a few images. With this, the computer is expected to
learn how to accurately identify that person under a wide
range of lighting conditions, facial expressions, hairstyles, etc.
That type of machine learning - where a model uses a small
number of examples to cover a wide range of inputs - is
referred to as few-shot learning, and was the focus of this
project.
In short, a dataset of labeled face images was used, with
only a few training images per label. Then, image
augmentation was used to effectively create more training
data. Then, different models were created and trained on the
training data. The models were then used to make predictions
for the test data, and various performance metrics were
recorded. The models were then refined and experimented
with to achieve the best results. This process is elaborated on
further below.
Dataset
---
To achieve this, the publicly available CelebA dataset is used.

Over 253 different people was used.
For each person, 5 images were picked at random for the training dataset, and 3 were picked for the test dataset.
Executable Files: `prepData.js`
> The dataset as downloaded from source has too many images per person and does not have a structure suitable for our use.
>
> This file is a Node.js script that picks out a few train/test images per person, and arranges them in a folder hierarchy that makes the dataset easier to use.
>
> Note that after running this code, the dataset still contains images for 10,000 people. For practical reasons, we manually reduced this down to 253 people by deleting the folders for all the others. The first 253 folders in numerical order (by name) were retained. This does not have any code associated with it as it was done manually.
>
> All the follwing Python code assumed the dataset has been run through this script.
Aside from the train and test dataset, an ‘other’ dataset was also prepared. This contained images of people that did not exist in the training dataset, and so could never be identified by any trained model. The purpose of this dataset was to ensure that the models could correctly predict when an image had an unknown face in it, rather than reporting a false positive for any existing label. This is important since
identifying strangers is a critical feature for any security
system that uses facial recognition. During training, this
‘other’set was to be appended to the eval set and used to tune
the models.
Finally, the images in all datasets needed to be turned into
a standardized numerical format that the models could work
with.
The original train dataset was then split into train and
evaluation (‘eval’) sets, at a 70/30 ratio. The eval set was
meant to be used to evaluate model performance during
training. At this point, the dataset was ready to be used. It had the
train, test, and eval (+ ‘other’) subsets with 7023, 900, and
3724 images respectively, and consisted of face embeddings
for the faces in the original images.
Image Augmentation
---
Image augmentation was performed on the training dataset,
where 9 new images were derived from each original image.
This involved manipulating the existing images in
various ways such as rotating, flipping, resizing, adjusting the
brightness, and adding noise. This helped increase the amount
of training data and also helped the models to better recognize
faces under various different conditions

Executable Files: `augmentation.py` This file contains code for image augmentation.
Multi-Task Cascaded Convolutional Networks (MTCNN)
---
First, the MTCNN facial extraction model was used to
identify the region of each image that contained a face.
MTCNN (Multi-Task Cascaded Convolutional Networks) is a
highly accurate face detection model that was developed in
2016 and is freely available. It uses three stages of
convolutional neural networks to perform its task. With it, the
face in each image could be cropped and then resized to 160
by 160 pixels.
<p align="center">
<img src="https://i.imgur.com/mcHtn5R.png">
</p>
FaceNet
---
FaceNet is a Google-developed ML system from 2015 that
“directly learns a mapping from face images to a compact
Euclidean space where distances directly correspond to a
measure of face similarity.” Put simply, the FaceNet model
was used to extract numerical feature mappings, called face
embeddings, from each image. A face embedding is a
128-element numerical array that captures information about
the unique or interesting features of the face. These can be
directly compared with each other or can be used to train
models and make predictions.
RESULTS
---
Exploratory Data Analysis
---
After augmentation, within 12,630 total examples, MTCNN
was able to detect 10,75 faces, around 85%; the remaining
face images were not used. Most of the face-detection errors
were from the results of image augmentation. This shows that
the image augmentation process used could be improved in the
future. The following is the error distribution.

To understand more about face embeddings, we decided to
find the centroid of each label. This was achieved by
averaging all the vectors of a person to find their spatial
center. This provides the opportunity to study which faces are
most similar, which are the most different by measuring the
distances from one to another.

According to the FaceNet model, these were the 2 most
similar people in our data set, with a distance of 5.22884 from
each other. The resemblance is notable by eye.

These were the 2 most different people, with a distance of
around 15. We might expect that these people would have
opposite genders and/or different skin colours, but these two
have the same gender and skin colour. This gives us an insight
that the model only focuses on facial features, not other visual
factors.
A dimensional reduction technique called tsne was employed
to visualize a sample of 100 labels from the training set:

It is clear that some clusters have formed, indicating that
different boundaries can be drawn for classification.
Interestingly some outliers are gathered in the middle; this
noise is possibly caused by the image augmentation.
Brute Force Euclidean Distance
---
This first system was the simplest possible approach. It
does not involve training. For prediction, the new ‘test’ face
embedding is simply compared to every single face
embedding from the training dataset by measuring the
Euclidean distances between them. The training image with
the shortest Euclidean distance from the test image is found,
and the person/label from that training image is identified as
the predicted identity for the test face. If the shortest distance
is still above a certain threshold, then the identity of the new
test face is declared to be unknown. This threshold is the only
parameter of this system; it was experimented with using the
eval set to find the optimal value.
Naturally, the brute-force method of comparing every
training sample to the test sample means this system is very
slow. The next approach is an optimized version of this
Euclidean distance comparison and is much faster.
After repeated trial and error, the threshold value of 10 was
found to be optimal. With this threshold, the system had an
accuracy of about 84.77% on the eval dataset, and the time
taken to make all the predictions was over 530 seconds.
On the test dataset, an accuracy of about 71.11% was achieved
in 122.44 seconds.
It should be noted that an analysis of the training dataset
showed that the two labels that were closest to each other (the
two different people in the dataset who looked the most alike)
had a Euclidean distance from each other of only 5.13 on
average. It is likely that errors made by the system involved
these individuals or others that looked like. This is also true
for the next K-D tree model.
We consider the Brute Force model as a baseline. It achieved
good accuracy with zero training time, but a terribly high
classification time. Inspired by KNN, we decided to employ a
K-D tree to optimize this process.
K-D Tree With Euclidean Distance
---
A K-D tree is a data structure that is used to organize
points in a k-dimensional space. This makes it ideal to store
face embeddings, as they are 128-dimensional vectors. While
building a K-D tree takes time, that initial investment pays off
later as the structure can be navigated very efficiently. The
samples in the training dataset can be arranged in a K-D tree
according to their Euclidean distances relative to each other.
Then instead of comparing the Euclidean distance of the new
test sample with every single training sample (as in the
previous section), the K-D tree can be used to efficiently
search through the training set and drastically reduce the
number of comparisons needed. It can be thought of as
searching through a sorted binary tree, except in k-dimensions
instead of 2. Similar to the previous approach, the only
parameter to change is the threshold at which the prediction is
“unknown”.
As will be explored further below, the results from the
previous approach are identical to those from this method, but
the latter is several times faster.
Since this optimized approach is very fast, predictions were
made for the eval dataset using every threshold value between
1 and 20. As with the previous approach, the optimal threshold
was found to be 10, and the highest accuracy achieved on the
eval dataset was 84.77%. The time taken was only about 6.87
seconds.
On the test dataset, an accuracy of about 71.11% was achieved
in 2.32 seconds.
The graph of accuracy on the eval dataset for each threshold
between 1 and 20 is below

Stacked SVM
---
Support Vector Machines are famous for their ability to fit
high dimensional data with limited examples, therefore an
SVM was a natural choice for this project. The previous
methods used a threshold value to decide when a sample was
from the ‘other’ dataset, meaning they did not exist in the
training set and were, therefore, unknown. To play a similar
role in this method, a second SVM was introduced to process
the output probability distribution of the first SVM. The
overall system structure is shown below:

The first model, SVM A, was trained to classify examples.
The second model, SVM B, was trained to study the
softmax-like probability distribution output of SVM A to
determine if the data point is from the ‘other’ dataset or not. If
its answer is yes, the example is concluded to be unknown; if
not, it will be pushed back to model A for class determination.

The above graph shows that the hypothesis is true. For the
unseen labels, there is a high density of around 5.3, whereas
the remaining samples are spread out in the lower unit.
###### tags: `Templates` `Documentation`
Below is a brief explanation of each file.
## `prepData.js`
The dataset as downloaded from source has too many images per person and does not have a structure suitable for our use.
This file is a Node.js script that picks out a few train/test images per person, and arranges them in a folder hierarchy that makes the dataset easier to use.
Note that after running this code, the dataset still contains images for 10,000 people. For practical reasons, we manually reduced this down to 253 people by deleting the folders for all the others. The first 253 folders in numerical order (by name) were retained. This does not have any code associated with it as it was done manually.
All the follwing Python code assumed the dataset has been run through this script.
Interestingly some outliers are gathered in the middle; this
noise is possibly caused by the image augmentation.
## `few_shot_face_recogntion.py`
This file contains functions to load the data, split it into train/test/eval, extract face embeddings, and perform other steps (excluding image augmentation) as described in the report. It also contains functions to run the first two methods described in the report that involve Euclidean distances.
Finally, the code at the end of the file uses the aforementioned functions to run experiments and show results for the first 2 methods.
## `svm_fewshot.py`
This file contains (aside from some helper functions identical to those in the previous file) all the code related to the third "Stacked SVM" method from our report. Running this file will perform the training, testing, and will generate results.
## `analysis.py`
This file contains code that analyses the dataset in various ways as discussed in the report. This includes generating plots to visualize data samples, probability distributions, etc.
## Other Notes
The original dataset (and associated .txt files) can be downloaded from: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
The specific implementation of FaceNet we used (the `facenet_keras.h5` file) can be found here: https://drive.google.com/file/d/1PZ_6Zsy1Vb0s0JmjEmVd8FS99zoMCiN1/view?usp=sharing