---
title: "DLIM: Face detection"
date: 2021-11-22 14:00
categories: [Image S9, TVID]
tags: [Image, S9, TVID]
math: true
---
Lien de la [note Hackmd](https://hackmd.io/@lemasymasa/rksnLMt_F)
# Face detection in general
*Why is facre detection so difficult ?*
> Pose (*Out-of-Plane Rotation*) and orientation (*In-Plane Rotation*)
> Presence or absence of structural components
> Occlusions
> Imaging conditions
> Faces are highly non-rigid object (*deformations*)
## Related problems
- Face localization
- Facial feature extraction (landmarks such as eyes, mouth, ...)
- Face recognition
- Verification
- Facial expression
# Overview of different approaches:
1. Knowledge top-down *base method*
2. Feature invariant methods (localization)
3. Template-matching methods (localization)
4. Appearance-based methods (detection)
## Apparence-based methods **in details**
- Eigenfaces
- Distribution-based methods
- Support Vector Machines (SVM)
- Sparse Network of Winnows
- Naive Bayes Classifier
- Hidden Markov models
![](https://i.imgur.com/vy81DNW.png)
- Information Theoretic Approaches (ITA)
- Inductive Learning (C4.5 and Find-S algorithms)
- Artficial Neural Networks (ANN) techniques
- Shallow networks (inverse de Deep)
- Deep learning
![](https://i.imgur.com/Kzwwlwo.png)
Les connections residuelles permettent de faire une retropropagation beaucoup plus loin que le reste du reseau.
:::warning
Le gros defaut des VGG: ils ont **enormement** de poids
:::
> poids $=$ parametres $\neq$ hyperparametres
![](https://i.imgur.com/0AF2tEV.png)
Il y a de moins en moins de neurones en \%.
:::danger
Plus on a de poids, plus on a de chance que notre reseau soit puissant mais plus on a de chance qu'une partie serve a rien.
:::
# The beginning in 1994
:::info
Burel et Carel proposent une methodologie pour les ANN's:
1. La phase d'entrainement ou un systeme *tunes* les parametres internes
2. La phase d'entrainement local ou le systeme adapte les poids specifiques a l'environnement d'un site local
3. La phase de detection durant laquelle **les poids ne bougent pas**
:::
Vaillant, Montcoq et Le Cun: first translation invariant ANN, decides if each pixel belong or not to a given object
Yang et Huang: first fully automatic human face recoginition system
## 1997
:::info
Rowleu, Baluja, Kanade propose the first rotation-invariant method:
- Uses template-based approach
- Methodology:
- Regions are proposed
- A router network estimates the orientation of this region
- This network prepares the windows using this angle
- A detector network decides if the window contains a face
![](https://i.imgur.com/rWhIs2L.png)
:::
## 2004
:::info
First real-time face detection algo by Viola & Jones
:::
- Tells if a given image of arbitrary size contains a human face, and if so, where it is
- Minimizes false positive and false negative rates
- Usually 5 types of Haar-like features
![](https://i.imgur.com/DuYfZWB.png)
- $24\times 24$ image contains **a huge number of features** ($162886$)
- Integral image for feature computation
![](https://i.imgur.com/vRsIdvf.png)
- $A=1$, $B=2$, $C=3$, etc.
:::success
Allows a low computational cost of features
:::
:::info
**Principle**
> The algorithm should deploy more resources to work on those windows more likely to contain
a face while spending as little effort as possible on the rest
:::
- We can use *weak classifiers*
- Then we can mae a strong one with a sequence of weak ones
- Viola & Jones: use AdaBoost
![](https://i.imgur.com/g4sPVPF.png)
The more layers, the less false positive:
![](https://i.imgur.com/kqlK5x2.png)
![](https://i.imgur.com/OPqvcQy.png)
# Overfeat (2014)
:::info
- Winner of the ImageNet Large Scale Visual Recognition Challenger of 2013
- Makes at the sae time classification (blocks), localization (grouping blocks) and detection (merge windows)
- This multitask approach boosts the performance of the network
- Trained on ImageNEt 2012
:::
- Inspired for multi-viewing voting
- Uses multiscale factor of 1.4
- Using a dense sliding windows thanks to convolution
> The better aligned the network window and the object, the strongest the confidence of the
network response.
- Efficiency: convolution computations in overlapping regions are *shared*
- Bounding boxes are accumulated instead of suppressed
- Only *one shared network for 3 functionalities*
- Uses a *feature extractor* for classification purpose
- Use *offset to refine the resolution of the proposed windows*
- Detection fine-tuning: negative training on the fly
## Methodology
- decomposition into blocks with 3 offsets
- for each block, estimation of the most probable corresponding class
- (overlapping) region proposals for each class (see below)
- bounding box deduction for each class (see below)
![](https://i.imgur.com/g6BS6RD.png)
![](https://i.imgur.com/q9SZ6Rv.png)
![](https://i.imgur.com/2Bdxt6e.png)
![](https://i.imgur.com/LbgeFtC.png)
![](https://i.imgur.com/3qVcyge.png)
# The MTCNN face detection algorithm (2016)
## Zhang, Zhang $ Li
- Real-time deep-learning-based face detection algorithm
- The MTCNN is a cascade of 3 similar networks (P/R/O-nets)
- The four steps:
1. Computation of the (multiscale) image pyramid
2. P-nEt: propositional network
3. R-net: refinement net (filters and refines the results of the P-Net)
4. O-net: output network (still refines, and propose landmarks)
![](https://i.imgur.com/gbmBdst.png)
- Use **hard sample mining** (the $30\%$ easier cases fo not intervene in the retropropagation) to improve the detection results
- Originality: uses *multi-task* learning, that is, every network
- predict bounding boxes
- use regression to refine/calibrate the position of the edges of the bounding box
- applies Non-Maxmal Suppression (NMS) to keep only relevant candidate windows (merge of highly overlapped candidates)
- (can) propose 5 facial landmarks
- This multi-task seems to improve face detection compared to usual mono-task learning
- How does it work in pratice? It minimizes:
$$
Loss=$\alpha_1\times L_{detection} + \alpha_2\times L_{regression} + \alpha_3\times L_{landmarks}$
$$
where the first is based on cross-entropy, and the others are based on Euclidian loss
![](https://i.imgur.com/ntCzVfd.png)
# Fast R-CNN and its predecessors (2014-2015)
## Spatial-Pyramid Pool network (2014)
- Have been proposed to speed up R-CNN by sharing computation,
- The SPPnet computes a shared feature map using convolutions over the entire image, and only then extract features corresponding to each proposal to make the prediction,
- Then it concatenates the features of the proposal coming from each scale thanks to MaxPooling to a $6 \times 6 \times$ scales map (spatial pyramid).
- SPP-nets accelerates R-CNN by 10 to 100 times at test times and by 3 at training time.
- Drawback 1: Like the R-CNN, it is a multi-stage approach:
- First, feature extraction using convolution,
- Second, fine-tuning of a network using log loss,
- Third, SVM training,
- Fourth, fitting bounding-box regressors.
- Drawback 2: Features are written to disk,
- The fine-tuning cannot update the convolutional layers that precede the spatial pyramid pooling (limited accuracy).
## Fast R-CNN0 (2015)
:::info
a Fast Region-based Convolutional Network method,
:::
- Mainly made of several innovation to make is faster
- Uses Singular Value Decomposition (SVD) truncation to fasten the computations,
- Uses a multi-task loss to train all the network in one single stage (it jointly learns to classify objects proposals (windows) and refine their spatial locations),
- Trains the VGG16 9 times faster than the RCNN and 3 times faster than the SPP-nets,
- Is able to retropropagate the error in the convolutional layers (contrary to SPPnets and RCNN) and then increases the accuracy,
- No disk storage is required for feature caching.
![](https://i.imgur.com/GSOTYDB.png)
![](https://i.imgur.com/d7Vkumr.png)
# Faster R-CNN (2016)
:::info
- Usual object detection methods depended on (slow) region proposal algorithms,
- They got the original idea to use ANN’s to do these predictions on GPU (much faster),
- They called this technology Region Proposal Networks (RPNs).
:::
## Properties
- is just made of several convolutional layers applied on the feature maps,
- It is then a fully convolutional layer (weights are shared in space),
- It is then translation-invariant in space (contrary to MultiBox method),
- it can be seen as a mini-network with a sliding-window applied on the feature map to predict proposals,
- predicts at the same time proposals using regression and objectness scores,
- is able to predict proposals with a wide range of scales and aspect ratios (bye default, 3 and 3 respectively).
:::success
Since the Fast R-CNN does not have region proposal, they added their RPN before the Fast-RCNN to obtain the Faster R-CNN,
:::
- The RPN is then an **attention network** since it tells to the Fast R-CNN where to look
- Since the efficiency of the Fast R-CNN depends on the region proposals, better proposal thanks to the RPN implies a better accuracy of the Faster R-CNN,
- To ensure that features used between the RPN and the Fast R-CNN are the same, they shared the weights of the Feature Extractor between them (faster, more accurate).
- It took then 10 milliseconds to compute the predictions of the RPN.
![](https://i.imgur.com/zoZ9vtf.png)
# Mask R-CNN (2018)
:::info
Extension of Faster R-CNN
:::
:::danger
aim is **instance segmentation**
:::
Has 3 outputs/prediction
1. the usual bounding box predictions (from Faster R-CNN),
2. the usual classification predictions (still from Faster R-CNN),
3. the mask predictions (A small FCN applied to each RoI – NEW !!),
*No competition* is done among classes prediction
- Mask prediction is done *in parallel*
- The training is done with a multi-task loss:
$$
Loss = \alpha_1L_{class} + \alpha_2L_{reg}+\alpha_3L_{mask}
$$
- We can easily change the backbone (feature extractor)
- It runs a 5 fps
![](https://i.imgur.com/BRmLTQK.png)
![](https://i.imgur.com/1xomgWc.png)
![](https://i.imgur.com/VXdyxTt.png)
# R-FCN Architectures (2016)
:::info
Region-based Fully Convolutional Networks
:::
- 2-stage object detection strategy
- Every layer is convolutional, whatever its role
- Almost all the computations are shared on the entire image
- Rols (candidate regions) are extracted to a Region Proposal Network (RPN)
- Uses position-sensitive score maps
![](https://i.imgur.com/MkSJ4sw.png)
![](https://i.imgur.com/EoDV021.png)
![](https://i.imgur.com/GwQyIok.png)
On decale la fenetre sur la droite:
![](https://i.imgur.com/zfNAKy8.png)
> At top-middle probability map, the white pixels correspond to the probability that a head
# RetinaNet (2018)
- One-stage detector
- Uses an *innovative focal loss*
- Naturally handles *class imbalance*
- Uses a **Feature Pyramid Network** (FPN) backbone of ResNet architecture
- Then it provides a *rich* multi-scale feature pyramid (efficiency)
- At each scale, they attach *subnetworks* to classify and make regressions
![](https://i.imgur.com/Urr3Vbx.png)
# Detectrons (2018-2019)
## Detectron V1 2018 (Facebook)
![](https://i.imgur.com/Hwu5RhV.png)
## Detectron V2 (Facebook)
![](https://i.imgur.com/GqBOC9W.png)
![](https://i.imgur.com/EEB6f2K.png)
![](https://i.imgur.com/cblq1WV.png)
# Real-time detection algorithms
## YOLO (You Only Look Once) (2016)
- single-shot detection architecture
- Designed for real-time applications
- It does NOT predict regions of interests
- It predicts a fixed amount of detections on the image directly,
- They are then filtered to contain only the actual detections.
- faster than region-based architectures
- lower detection accuracy
- performs a multi-box bounding box regression on the input image directly
- Method: the image is overlayed by a grid, and for each grid cell, a fixed amount of detections are predicted.
![](https://i.imgur.com/Np8qiEY.png)
## SSD (Single Shot Multibox Detector) (2016)
- Is a single-shot detection architecture
- Instead of performing bounding box regression on the final layer like YOLO, SSDs append additional convolutional layers that gradually decrease in size.
- For each additional layer, a fixed amount of predictions with diverse aspect ratios are computed,
- It results in a large number of predictions that differ heavily across size and aspect ratio.
![](https://i.imgur.com/ZQIztkw.png)
## YOLOv2 (YOLO 9000) (2016)
- Extension of YOLOv1
- Ability to predict objects at different resolutions,
- Computes the first bounding box predictions using clustering,
- Better performance than SSD.