Face Detection

# Face Detection ###### tags: `Research` We need a fast algorithm to find faces in images coming from the camera in real-time. If we ever decide to expand the scope to unlock the door when a certain face shows up, then we'll need this feature anyway. A possible implementation strategy for that would be to detect the presence of a face on the FPGA (since that needs to be done in real time), but determine if that face is authorized on the backend server (possibly using a library), since that does not have to be fast. It is a machine learning algorithm, but does not use a neural network. ## Viola-Jones [P. Viola and M. Jones, 2001](https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-01.pdf) Recent-ish (2001) method for detecting objects very quickly. Good for finding faces in video in real time. Some newer algorithms (based on CNNs) claim to be able to detect faces faster, but are more complex and heavily memory bound, and so not as suitable for implementation on FPGAs. ![](https://i.imgur.com/eHjR5nb.jpg) ### Features Viola-Jones works overlaying a (small, often 24x24) window over the image and evaluating the likelihood of it being a face. The classifier works by identifying rectangular **features**: ![](https://i.imgur.com/Q0uHWtm.png =500x) A feature is evaluated by subtracting the sum of white pixels from the sum of the grey pixels. The three types of features the original VJ paper uses are shown here: two-rectangle, three-rectangle, and four-rectangle. ### Integral Images We can speed up computing the value of a feature by converting our grayscale image into an [**integral image**](https://en.wikipedia.org/wiki/Summed-area_table). The integral image has the same resolution, but every "pixel" is the sum of the brightnesses for all pixels in the rectangle starting from the top left. ![](https://i.imgur.com/qsD2YuA.png) More formally, given a greyscale image $i(x,y)$, the integral image $ii(x,y)$ is defined as: $$ii(x,y) = \sum_{x'\leq x, y'\leq y} i(x',y')$$ We can compute the sum of any rectangular section of the image with only four points on the integral image, which can be pre-computed in one pass. ![](https://i.imgur.com/DY9Q5Dg.png) The sum of the pixels in rectangle D is: $$D = p_4 + p_1 - (p_2 + p_3)$$ TODO: Show how to compute the integral image. (Dynamic programming, will have to think about doing it on the FPGA). ### Classifiers A complete model gives us a series of features labelled $1\ldots j$ (of one of the three types), as well as a weight $\alpha_j$, threshold $\theta_j$, and parity $p_j$. Each feature $j$ has a weak classifier $h_j$ that takes an image $x$: $$h_j(x) = \begin{cases} \alpha_j & p_j f_j(x) < p_j \theta_j \\ 0 & \textrm{otherwise} \end{cases}$$ In other words, the classifier is active if the feature evaluates to a value below the threshold (or above, if the parity is flipped), and results in a value of $a_j$. The full classifier is simply the sum of all the weak classifiers (a weight sum of all the features). [Example implementation from OpenCV](https://github.com/opencv/opencv/blob/master/apps/traincascade/haarfeatures.h#L73) (good if we'd like to use their models). I think they're using a slightly different weight setup (two weights, one for the positive/negative rectangles). ### Attentional Cascade The model described in the original paper has 6000 features. This would be fairly slow, especially if you want to slide the window across a large image one pixel at a time. It is sped up greatly using the **attentional cascade**, a series of simpler classifiers that we use to quickly reject windows that almost certainly don't contain a face. ![](https://i.imgur.com/63X1Z6c.png) The OpenCV data has 25 stages. Here's how they differ in complexity, for comparison. (You can also see how stage 0 quickly identifies large, obvious features like that the eye region is darker than its surroundings, the bridge of the nose is lighter, etc.) The authors also argue that because simple classifiers (they use a two-feature first stage) are so effective at rejecting large parts of the image, it's much faster than neural network techniques that would have to evaluate at minumum a one-layer perceptron. Stage 0: ![](https://i.imgur.com/ly2SFXu.png) Stage 24: ![](https://i.imgur.com/yIfLKKa.jpg =100x) ### Scanning At each step of the scanning process, we shift the window over by some step size (in pixels) $\Delta$. In the paper, they tried both $\Delta = 1.0$ and $\Delta = 1.5$. To detect faces at differing scales, they scan different scales a factor of 1.25 apart. They scale the detectors rather than the image, so that the integral image must only be computed once. However, the detector may fire several times for the same face at nearby positions and scales. We account for this by combining all overlapping detections together and drawing a box around them, giving our final bounding box for the face. ### Training TODO: Put some cursory information about how training is done for this, even though we likely won't have to do it. Available pre-trained models: - [OpenCV faces, frontal, 24x24](https://github.com/opencv/opencv/blob/master/data/haarcascades/haarcascade_frontalface_default.xml) #### OpenCV Classifier Notes: The normalization rectangle is 1 smaller than the classifier window in all dimensions (22x22 for a 24x24 classifier). If $s$ is the sum of the pixels in the normalization rectangle, $r$ is the sum of the squares, and $N$ is the total number of pixels, then the normalization factor $f$ is: $$f = \frac{1}{\sqrt{N r - s^2}}$$ This section is meant to summarize how to use the OpenCV model. Unhelepful tags are omitted. The rough structure is as follows: ```xml <opencv_storage> <cascade type_id="opencv-cascade-classifier"> <stageType>BOOST</stageType> <featureType>HAAR</featureType> <height>24</height> <width>24</width> <stages> (stages) </stages> <features> (features) </features> </cascade> </opencv_storage> ``` The `BOOST` type indicates that this is a boosted tree classifier (the kind described earlier). We are using `HAAR` features (adjacent rectangular locations, as described earlier). We are given the width/height of the detector to be scanned across the image. The features list contains elements of the following form: ```xml <_> <rects> <_> x y width height weight </_> ... </rects> </_> ``` `x`, `y`, `width`, and `height` are all integers that specify the location of a rectangle in the detector region. `weight` is a float (though they are all conveniently integers in the default model) that is multiplied by the sum of the pixels in the rectangle to compute the value. For example, a Haar feature with one positive region and one negative would have two rectangles with weights `+1` and `-1`. The sum is normalized. TODO: Find out how The stages list contains elements of this form: ```xml <_> <stageThreshold> threshold </stageThreshold> <weakClassifiers> <_> <internalNodes> lidx ridx feature_idx weak_threshold </internalNodes> <leafValues> left_value right_value </leafValues> </_> </weakClassifiers> </_> ``` OpenCV provides tree-based weak classifiers (which we will probably not be using) and stump classifiers more similar to the original paper, but not quite the same. A stage is evaluated by traversing the weak classifiers (possibly in parallel), and evaluating the feature at `feature_idx`. If it is less than the `weak_threshold`, we add `left_value` to the sum, otherwise `right_value`. If the total sum is greater than the `threshold`, the detector stage fires. ## Implementation on an FPGA [P. Irgens et al, 2017](https://www.sciencedirect.com/science/article/pii/S2468067216300116?via%3Dihub)