# [Paper Reading] Deep Learning for Understanding Faces Machines may be just as good, or better, than humans [R. Ranjan et al., Jan. 2018](https://ieeexplore.ieee.org/document/8253595) ## Overview of Automatic Face Identification System Three modules are typically needed for such an automatic system. 1. A face detector localizes faces in images or videos. 2. A fiducial point detector localizes the important facial landmarks. 3. A feature descriptor that encodes the identity information is extracted from the aligned face. A large number of annotated unconstrained face data sets is a key to DCNN's sucess. ## Face Detection in Unconstrained Images * Region based A pool of candidates of faces is proposed by an object detector, and then these regions are classified by a DCNN. Drawbacks: * Difficult faces are hard to capture by object detector. * The two stages method increases computation time. * Sliding-window based * Detections at different scales are typically carried out by creating an image pyramid at multiple scales. * End-to-End * Single-shot detector SSD is a sliding-window-based detector, but it deal with different scale by pooling intermediate layers at different scales. Naturaly, SSD can operate faster than sliding-window method with image pyramid. ## Fiducial Point Detector Fiducial point detectors locate keypoints in a face and align the face into canonical coordinates. Good alignment can greatly help face recognition performance. * Model based * 3-D approaches performs better than 2-D approaches. * Cascaded regression based * the performance depends on the robustness of local descriptors ## Face identification and verification ![](https://i.imgur.com/Zzq3S74.png) Two major components: 1. Robust face representation Yang et al. proposed a neural aggregated network (NAN) to perform dynamically weighted aggregation on the features from multiple face images or frames of a video to yield a succinct and robust representation for video face recognition. 2. Discriminative classification models or similarity measure Most common similarity metrics are: * L2 distance * cosine similarity DCNN models trained on a combination of still images and video frames perform better than those trained on only one type. For smaller models, training using wider data sets is better, while for deeper models, training using deeper data sets is better. ## Data sets Images in IJB-A contain extreme pose, illumination, and expression variations. These factors essentially make the IJB-A a challenging face recognition data set. ## Multitask Learning Goodfellow et al. interprets MTL as a regularization method for DCNNs. ## Open Issues * Due to memory constraints, how to choose informative pairs or triplets and train end to end on large-scale data sets is still an open problem. * How to incorporate full motion video processing in deep networks for enabling video-based face analytic?