Try   HackMD

Lien de la note Hackmd

Face detection in general

Why is facre detection so difficult ?

Pose (Out-of-Plane Rotation) and orientation (In-Plane Rotation)
Presence or absence of structural components
Occlusions
Imaging conditions
Faces are highly non-rigid object (deformations)

  • Face localization
  • Facial feature extraction (landmarks such as eyes, mouth, )
  • Face recognition
  • Verification
  • Facial expression

Overview of different approaches:

  1. Knowledge top-down base method
  2. Feature invariant methods (localization)
  3. Template-matching methods (localization)
  4. Appearance-based methods (detection)

Apparence-based methods in details

  • Eigenfaces
  • Distribution-based methods
  • Support Vector Machines (SVM)
  • Sparse Network of Winnows
  • Naive Bayes Classifier
  • Hidden Markov models

  • Information Theoretic Approaches (ITA)
  • Inductive Learning (C4.5 and Find-S algorithms)
  • Artficial Neural Networks (ANN) techniques
    • Shallow networks (inverse de Deep)
    • Deep learning

Les connections residuelles permettent de faire une retropropagation beaucoup plus loin que le reste du reseau.

Le gros defaut des VGG: ils ont enormement de poids

poids

= parametres
hyperparametres

Il y a de moins en moins de neurones en %.

Plus on a de poids, plus on a de chance que notre reseau soit puissant mais plus on a de chance qu'une partie serve a rien.

The beginning in 1994

Burel et Carel proposent une methodologie pour les ANN's:

  1. La phase d'entrainement ou un systeme tunes les parametres internes
  2. La phase d'entrainement local ou le systeme adapte les poids specifiques a l'environnement d'un site local
  3. La phase de detection durant laquelle les poids ne bougent pas

Vaillant, Montcoq et Le Cun: first translation invariant ANN, decides if each pixel belong or not to a given object

Yang et Huang: first fully automatic human face recoginition system

1997

Rowleu, Baluja, Kanade propose the first rotation-invariant method:

  • Uses template-based approach
  • Methodology:
    • Regions are proposed
    • A router network estimates the orientation of this region
    • This network prepares the windows using this angle
    • A detector network decides if the window contains a face

2004

First real-time face detection algo by Viola & Jones

  • Tells if a given image of arbitrary size contains a human face, and if so, where it is
  • Minimizes false positive and false negative rates
  • Usually 5 types of Haar-like features

  • 24×24
    image contains a huge number of features (
    162886
    )
  • Integral image for feature computation

  • A=1
    ,
    B=2
    ,
    C=3
    , etc.

Allows a low computational cost of features

Principle

The algorithm should deploy more resources to work on those windows more likely to contain
a face while spending as little effort as possible on the rest

  • We can use weak classifiers
  • Then we can mae a strong one with a sequence of weak ones
  • Viola & Jones: use AdaBoost

The more layers, the less false positive:

Overfeat (2014)

  • Winner of the ImageNet Large Scale Visual Recognition Challenger of 2013
  • Makes at the sae time classification (blocks), localization (grouping blocks) and detection (merge windows)
  • This multitask approach boosts the performance of the network
  • Trained on ImageNEt 2012
  • Inspired for multi-viewing voting
  • Uses multiscale factor of 1.4
  • Using a dense sliding windows thanks to convolution

The better aligned the network window and the object, the strongest the confidence of the
network response.

  • Efficiency: convolution computations in overlapping regions are shared
  • Bounding boxes are accumulated instead of suppressed
  • Only one shared network for 3 functionalities
  • Uses a feature extractor for classification purpose
  • Use offset to refine the resolution of the proposed windows
  • Detection fine-tuning: negative training on the fly

Methodology

  • decomposition into blocks with 3 offsets
  • for each block, estimation of the most probable corresponding class
  • (overlapping) region proposals for each class (see below)
  • bounding box deduction for each class (see below)

The MTCNN face detection algorithm (2016)

Zhang, Zhang $ Li

  • Real-time deep-learning-based face detection algorithm
  • The MTCNN is a cascade of 3 similar networks (P/R/O-nets)
  • The four steps:
    1. Computation of the (multiscale) image pyramid
    2. P-nEt: propositional network
    3. R-net: refinement net (filters and refines the results of the P-Net)
    4. O-net: output network (still refines, and propose landmarks)

  • Use hard sample mining (the
    30%
    easier cases fo not intervene in the retropropagation) to improve the detection results
  • Originality: uses multi-task learning, that is, every network
    • predict bounding boxes
    • use regression to refine/calibrate the position of the edges of the bounding box
    • applies Non-Maxmal Suppression (NMS) to keep only relevant candidate windows (merge of highly overlapped candidates)
    • (can) propose 5 facial landmarks
  • This multi-task seems to improve face detection compared to usual mono-task learning
  • How does it work in pratice? It minimizes:

Loss=$α1×Ldetection+α2×Lregression+α3×Llandmarks$

where the first is based on cross-entropy, and the others are based on Euclidian loss

Fast R-CNN and its predecessors (2014-2015)

Spatial-Pyramid Pool network (2014)

  • Have been proposed to speed up R-CNN by sharing computation,
  • The SPPnet computes a shared feature map using convolutions over the entire image, and only then extract features corresponding to each proposal to make the prediction,
  • Then it concatenates the features of the proposal coming from each scale thanks to MaxPooling to a
    6×6×
    scales map (spatial pyramid).
  • SPP-nets accelerates R-CNN by 10 to 100 times at test times and by 3 at training time.
  • Drawback 1: Like the R-CNN, it is a multi-stage approach:
    • First, feature extraction using convolution,
    • Second, fine-tuning of a network using log loss,
    • Third, SVM training,
    • Fourth, fitting bounding-box regressors.
  • Drawback 2: Features are written to disk,
  • The fine-tuning cannot update the convolutional layers that precede the spatial pyramid pooling (limited accuracy).

Fast R-CNN0 (2015)

a Fast Region-based Convolutional Network method,

  • Mainly made of several innovation to make is faster
  • Uses Singular Value Decomposition (SVD) truncation to fasten the computations,
  • Uses a multi-task loss to train all the network in one single stage (it jointly learns to classify objects proposals (windows) and refine their spatial locations),
  • Trains the VGG16 9 times faster than the RCNN and 3 times faster than the SPP-nets,
  • Is able to retropropagate the error in the convolutional layers (contrary to SPPnets and RCNN) and then increases the accuracy,
  • No disk storage is required for feature caching.

Faster R-CNN (2016)

  • Usual object detection methods depended on (slow) region proposal algorithms,
  • They got the original idea to use ANN’s to do these predictions on GPU (much faster),
  • They called this technology Region Proposal Networks (RPNs).

Properties

  • is just made of several convolutional layers applied on the feature maps,
  • It is then a fully convolutional layer (weights are shared in space),
  • It is then translation-invariant in space (contrary to MultiBox method),
  • it can be seen as a mini-network with a sliding-window applied on the feature map to predict proposals,
  • predicts at the same time proposals using regression and objectness scores,
  • is able to predict proposals with a wide range of scales and aspect ratios (bye default, 3 and 3 respectively).

Since the Fast R-CNN does not have region proposal, they added their RPN before the Fast-RCNN to obtain the Faster R-CNN,

  • The RPN is then an attention network since it tells to the Fast R-CNN where to look
  • Since the efficiency of the Fast R-CNN depends on the region proposals, better proposal thanks to the RPN implies a better accuracy of the Faster R-CNN,
  • To ensure that features used between the RPN and the Fast R-CNN are the same, they shared the weights of the Feature Extractor between them (faster, more accurate).
  • It took then 10 milliseconds to compute the predictions of the RPN.

Mask R-CNN (2018)

Extension of Faster R-CNN

aim is instance segmentation

Has 3 outputs/prediction

  1. the usual bounding box predictions (from Faster R-CNN),
  2. the usual classification predictions (still from Faster R-CNN),
  3. the mask predictions (A small FCN applied to each RoI – NEW !!),

No competition is done among classes prediction

  • Mask prediction is done in parallel
  • The training is done with a multi-task loss:

Loss=α1Lclass+α2Lreg+α3Lmask

  • We can easily change the backbone (feature extractor)
  • It runs a 5 fps

R-FCN Architectures (2016)

Region-based Fully Convolutional Networks

  • 2-stage object detection strategy
  • Every layer is convolutional, whatever its role
  • Almost all the computations are shared on the entire image
  • Rols (candidate regions) are extracted to a Region Proposal Network (RPN)
  • Uses position-sensitive score maps

On decale la fenetre sur la droite:

At top-middle probability map, the white pixels correspond to the probability that a head

RetinaNet (2018)

  • One-stage detector
  • Uses an innovative focal loss
  • Naturally handles class imbalance
  • Uses a Feature Pyramid Network (FPN) backbone of ResNet architecture
  • Then it provides a rich multi-scale feature pyramid (efficiency)
  • At each scale, they attach subnetworks to classify and make regressions

Detectrons (2018-2019)

Detectron V1 2018 (Facebook)

Detectron V2 (Facebook)

Real-time detection algorithms

YOLO (You Only Look Once) (2016)

  • single-shot detection architecture
    • Designed for real-time applications
    • It does NOT predict regions of interests
    • It predicts a fixed amount of detections on the image directly,
    • They are then filtered to contain only the actual detections.
  • faster than region-based architectures
  • lower detection accuracy
  • performs a multi-box bounding box regression on the input image directly
  • Method: the image is overlayed by a grid, and for each grid cell, a fixed amount of detections are predicted.

SSD (Single Shot Multibox Detector) (2016)

  • Is a single-shot detection architecture
  • Instead of performing bounding box regression on the final layer like YOLO, SSDs append additional convolutional layers that gradually decrease in size.
  • For each additional layer, a fixed amount of predictions with diverse aspect ratios are computed,
  • It results in a large number of predictions that differ heavily across size and aspect ratio.

YOLOv2 (YOLO 9000) (2016)

  • Extension of YOLOv1
  • Ability to predict objects at different resolutions,
  • Computes the first bounding box predictions using clustering,
  • Better performance than SSD.