AISS-CV Group 1: Perfect group images using a two-step smile detection algorithm

# AISS-CV Group 1: Perfect group images using a two-step smile detection algorithm In this document, we will describe our project progress and decisions taken during the development. It is divided into the six big phases of the Cross Industry Standard Process for Data Mining (CRISP-DM) standard model. ## Table of Contents - [Business Understanding](#business-understanding) - [Data Understanding](#data-understanding) - [Data Preparation](#data-preparation) - [Modeling](#modeling) - [Evaluation](#evaluation) - [Deployment](#deployment) ## Business Understanding The problem with group photos is, they’re hard to coordinate and usually rely upon timer-based methods (for example: “everyone say cheese”). These methods do not guarantee a good outcome, as they completely rely on the coordination of the participants and as well on the reaction time of the photographer. Another major problem comes with the current corona situation. To reduce the risk of infection, devices that are usable hands-free become even more desirable. Therefore, we introduce our product: A fully automatic photobox, that activates when enough people are smiling. A good way to face the problems from the beginning comes with a machine learning approach. It could enable an automated shutter release that can guarantee desired outcomes, based on face and smile detection algorithms. A good implementation could lead to less time wasted for coordination and in general better pictures. Mostly because the human failure due to reaction time will be drastically decreased. Furthermore, the display of interesting tech behind this device could generate and additional value, because one might become even more interested in a product with such a modern technical background. Furtheremore, our product can be used in public spaces to provide a "digital postcard" - service to e.g. tourists. The photobox will be mounted at landmarks - spots to take the group pictures, which will consequently be sent to the user via email. Another way to look at the possibilities with such a product is the business case. On the one hand one can think of an offer like Software as a Service (SaaS) or an app deployment. Doing this we would only be renting out the software and include full support during the contract period. The main advantages here are low costs and strong scaling. On the other hand, we could offer the product with additional hardware for instantaneous use. In this case the hardware would look like a photobox and a printer. This will lead to even more customer retention, but probably fewer scaling opportunities. In conclusion: for short term profits we would probably decide for the second approach (complete package). Until the first approach (SaaS) gains enough traction for it to be highly profitable. ## Data Understanding ### Datasets Coming to the next phase, it is time to understand the data. We start by looking at the data we will actually need in the process. In our case, data corresponds to images containing faces as well as their corresponding labels. Since our project is split into two parts, we also need two different datasets: 1. A dataset to validate the different face detection models 2. A dataset to train and validate the smile detection models These 2 datasets differ somewhat in their structure and content.\ **The first dataset** needs to contain large scale images of humans where faces are visible, so our face detection can be evaluated. An example image for that would be a photo of a family or a group of people like seen in the following image. ![image](./images/group_image.jpg)\ *Credit: [A. Gallagher and T. Chen, 2009](http://chenlab.ece.cornell.edu/people/Andy/ImagesOfGroups.html)* Ideally, this dataset also comes with labeled bounding boxes for the faces (green). However, that is not always the case and is something we need to consider when choosing our datasets. For example the ImagesOfGroups dataset does not contain bounding boxes for the faces. Instead it only contains marks on the eyes, visible as small red areas in the image. During the data preparation phase, we will address and solve this issue. **The second dataset** is slightly different. Contrary to the first dataset, this dataset does not contain large scale images of humans. Instead, it contains cropped images of single faces. Each of these images should then be annotated with a label indicating whether the face is smiling or not. Two examples for smiling and non-smiling images from such a dataset are shown in the following table. | SMILE | NO-SMILE | |:------:|:-----:| <img src="./images/smile.jpg" width="150" height="150" /> | <img src="./images/no_smile.jpg" width="150" height="150" /> *Credit: hromi [GitHub](https://github.com/hromi/SMILEsmileD)* These types of images can then be used as positive and negative input to train our smile detection models. As can be seen, some of the images are gray-scale, while others are in RGB color. However, this is not a problem as we can use preprocessing techniques to convert the images to the same format, which in our case will be gray-scale. Obviously, there are many different facial databases that we can use for our project. The only constraint is that the images must be publicly available and licensed for our use-case (See section Data Privacy). By stacking these datasets together, we can use a large amount of data and eliminate the possibililty of overfitting on the details of datasets. This also necessitates a standardization of the labels, which will be addressed in the [Data Preparation](#data-preparation) section. Specifically, we will use a combination of the datasets described in the following table. To download and combine all data sources into one usable dataset, we created [a bash script](data/build_data_from_source.sh) which executes all necessary commands to build the dataset from source. | Dataset | Size | Credit | Download | | ----------- | --- | ----------- | -------- | | Genki-4k |4.000 | [UC San Diego](https://inc.ucsd.edu/mplab/398/) |[genki4k.tar](https://inc.ucsd.edu/mplab/398/media/genki4k.tar)| | SMILEs |9.476 (neg.) + 3.690 (pos.)| [Github user hromi](https://github.com/hromi/SMILEsmileD) |[master.zip](https://github.com/hromi/SMILEsmileD/archive/refs/heads/master.zip)| | LFWcrop|603 (neg.) + 600 (pos.) |Labels: [Sanderson, Lovell: Multi-Region Probabilistic Histograms for Robust and Scalable Identity Inference.](https://conradsanderson.id.au/pdfs/sanderson_icb_2009.pdf) <br /> Images: [Huang et al: Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments.](http://vis-www.cs.umass.edu/lfw/lfw.pdf) |[Labels](https://data.mendeley.com/datasets/yz4v8tb3tp/5)<br />[Images](https://conradsanderson.id.au/lfwcrop/lfwcrop_grey.zip) | Own data | 197 | AISS Group 1 (2022)| [here](./data/custom_data/) | ImagesOfGroups | 5.075 | [A. Gallagher, T. Chen: Understanding Groups of Images of People](http://chenlab.ece.cornell.edu/people/Andy/ImagesOfGroups.html) | [Images](http://chenlab.ece.cornell.edu/people/Andy/ImagesOfGroups.html) |**Total**|**18.369**||[dataset.zip](https://bwsyncandshare.kit.edu/s/jWLEEBEfp8SHiFm) ### Data Privacy **General** In Germany, the data privacy law is managed in the “Datenschutz-Grundverordnung” (DSGVO). For our use case and machine learning applications in general that use or process personal data there are two major articles of interest: - Article 5 DSGVO: It deals with the processing of personal Data in general and how to handle it in a lawful correct way. - Article 6 DSGVO: It deals with the legality of the processing of the data regarding to the consent of the person who provided the data. Especially for machine learning applications is Art. 6 IV DSGVO of interest. It checks if further processing of the Data is allowed. For example, if the original purpose is still compatible with the result after several processing steps. **Our Project** For our project we use five different datasets. Since we don’t have any copyright concerns with four of them, because either its data from ourselves or it’s clearly pointed out how to use it, we won’t discuss them further in this section. We want to take a closer look at the SMILEs dataset, since it’s the largest and therefore a very important data source for our project. The dataset is provided on GitHub by a user named “hromi”. He makes clear on his site that the Dataset is open source, but no references were given. It was heavily discussed in our group if we can use this dataset. Therefore, we made some points clear: - The Dataset contains only prominent pictures. So, the copyright law is only applicable to a limited extent (for example if they were photographed in a private space) according to §23 I Nr. 1 KunstUrhG. - We use the data only for non-commercial academical research. - The pictures are heavily edited. They are down sampled and consist of facial cutouts only. Furthermore, we think that – without any legal expertise – our usage of the data is part of the “Fair Use” according to 17 U.S. Code § 107. It states that reproduction of copyrighted material for the purpose of criticism, comment, reporting education and scholarship does not constitute copyright infringement. Whether or not a use of copyrighted material is fair must be weighted on a case-by-case basis according to the following criteria: - Purpose and nature of the use - Nature of the copyrighted work. - Extent and significance of the excerpt used in relation to the entire work - Impact of the use on the value and exploitation of the protected work *Credit: [U.S. Copyright Office](https://www.copyright.gov/fair-use/more-info.html)* However, “Fair Use” is not applicable in Germany, at least not entirely. Most similar to that law are the “Schranken des Urheberrechts”. Some interesting paragraphs that could probably find application in our case are: - §44a UrhG containing the temporarly usage of copyrighted material. - §§60a-60h UrhG containing the usage of copyrighted material in a scientific environment. Since we are no legal experts but considering all the information above, we concluded that we are allowed to make use of the dataset. Even if there is not directly pointed out reference. **Copyright concerns with our final product** To be allowed to use our product, we need the agreement from the participants to appear on a video stream that is getting analyzed by our model. We also need the agreement to take a picture of each person appearing on the picture. Furthermore, to send the picture to the person, we need an agreement of their personal information (for example their e-mail address). ## Data Preparation During the data preparation phase, we will create our two final datasets. For the face detection evaluation we will use the ImagesOfGroups dataset from [here](htttp://chenlab.ece.cornell.edu/people/Andy/ImagesOfGroups.html). Like mentioned in the last section, there are no face bounding boxes included. Instead, we will use the marks on the eyes (red) to create bounding boxes like shown in the figure below. This is done by extending the eye marks in a specific way. First the horizontal distance between both eyes is calculated (light blue). To create the x values for the bounding boxes we extend the x values of both eyes by 75% of the horizontal distance (orange). To create the y values for the bounding boxes we extend the y values of the eyes by 125% of the horizontal distance upwards and 200% downwards (orange). We then can draw a bounding box around the whole face (green). This method may not be perfect, but it is a good starting point. After manual review of the bounding boxes, we were satisfied enough and therefore did not spend any more time on improving the bounding box creation. <img src="./images/bounding_box_creation.png" width="255" height="357" />\ *Credit: [A. Gallagher and T. Chen, 2009](http://chenlab.ece.cornell.edu/people/Andy/ImagesOfGroups.html)* ### Smile Detection As input for our smile classification model, we will use four different datasets, as show in the last section. Since they're all annotated in a different way, we need to standardize them. Some of them distinguish between smiling and non-smiling faces by having the corresponding images in different subdirectories. Others use a file to match the labels to image file names. We process all of these methods and create one universal csv file matching images to labels for the combined dataset. To limit the strain on performance, we tried to simplify the smile-detection with appropriate preprocessing. Since most of our training data consists of grayscale images, we decided to use only one input channel. To boost the runtime even further, we tried using only downscaled (32x32 px) images. However, this showed a serious decline in accuracy, so we settled on preprocessing the images to 64x64 (grayscale) which turned out to be a good compromise on the tradeoff between speed and accuracy. During live demonstartion, we discovered that the model seemed to be sensitive towards tilting the head sidewards (only to the left side) or rotating the face around the center - resulting in false-positive classifications. Therefore, we tried to produce a more robust model by augmenting our dataset. To achieve this, we added two augmented images for each image present in the dataset: One is flipped vertically to work against a suspected bias towards right-leaning faces. The other image is tilted by 15° (alternating between -15° and 15°). This inflated the whole smile classification dataset from 18.369 images to 55.107 images. The function to perform this augmentation (saving to ./data/dataset_augmented) can be found in [data/util.py - augment()](data/util.py). All data that we created and subsequently used for training the smile classificator can be downloaded via bwsyncandshare from [here](https://bwsyncandshare.kit.edu/s/jWLEEBEfp8SHiFm). The directory contains `dataset_smile-classification.tar.gz`, which is used to train the smile classificator, as well as `dataset_face-detection`, which is used to evaluate the face detection algorithms. ## Modeling This section describes the modeling phase of the project. We divide the modeling phase into face detection and smile detection. ### Face detection Face detection is the task of detecting faces in images. Or as in our case detecting faces in a sequence of images (video). This is the necessary first step to be able to classify if a face is smiling or not. This can become a complex task very quickly when building face detection models from scratch. The most difficult part is the required data. From a technical point of view, to train a model one needs many different people on the images, different backgrounds, different positions and more. Another critical point is the data privacy. Each person that is pictured in the image must authorize the use and the processing of their personal image. Considering those points, we decided to use two pre-trained models. In the end we evaluated both so we can see which one fits better to our use case. On the one hand a conservative approach with Cascade Filters, on the other hand a deep learning approach with the Multi-task Cascaded Convolutional Neural Networks (MTCNN). Those two models will be further described in this section. #### Cascade Filter This section will describe one of the face detection models we used. The model is a cascade filter. The idea of Haar Cascades has emerged before Neural Networks were a thing. However, due to their simplicity yet high accuracy, they are still commonly used for face-detection tasks. This was the main motivation for us to use this model, as our deployment device has limited computational power. Thus we hoped for a low inference time while achieving good classification results for the given images. Haar Cascade is an Object Detection Algorithm used to identify faces in an image or a real time video. It uses edge or line detection features by identifying sudden changes of intensities in pixel values, as proposed by a Research Paper from 2001: *[Rapid Object Detection using a Boosted Cascade of Simple Features](https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-01.pdf)* A kernel with a specific haar structure, e.g. a line represented as white pixels on the left and black pixels on the right, is mapped to the image region and the difference between the average of the pixels in the light region and the average of pixels on the dark region is computed. If close to 1, an edge in the shape of the kernel would be detected. ![image](./images/haar_cascade.png) *Credit: [Towards Data Science](https://towardsdatascience.com/face-detection-with-haar-cascade-727f68dafd08)* This kernel needs to be applied multiple times in different sizes, traversing the whole image. Combined with other features, a face can be detected. Since this would mean a lot of computations, the *Integral Image* was introduced. Each pixel values is now updated to the sum of all pixel values that are on the left or above that pixel. This reduces the amount of mathematical operations to four constant value additions per feature. #### MTCNN This section will describe the second face detection model we used. The model is a multi-task CNN. This kind of model, that is built out of three stages of CNNs, has shown good performance in recent literature (Zhang, K., Zhang, Z., Li, Z., and Qiao, Y. (2016). *Joint face detection and alignment using multitask cascaded convolutional networks*). It is able to detect faces in an image by giving landmark locations such as eyes, nose and mouth. Therefore, it consists of three layers. The first one, called "Proposal Network" (P-Net) is a FCN (Fully Convolutional Network), that does not contain a dense layer and uses bounding box regression to quickly produce windows that possibly contain a face. <div align="center"> <img src="./images/mtcnn_pnet.png" width="600"/> </div> In a second, more dense CNN-layer called "Refine Network" (R-Net), proposed candidates are refined. Its outputs are whether there is a face in the given box or not, its bounding box as a 4-dimensional vector, and the facial landmarks as a 10-dimensional vector. <div align="center"> <img src="./images/mtcnn_rnet.png" width="600"/> </div> The third layer, called "Output Network" (O-Net) returns more details of the face, including 5 positions for the facial landmarks. <div align="center"> <img src="./images/mtcnn_onet.png" width="600"/> </div> This face detection model should, in theory, benefit from newer technologies like GPU computing via CUDA. Compared to Cascade Filters it is a more 'modern' approach to the task. ### Smile detection #### Approach To retain flexibility and fine-grained control, we opted to implement our own smile classification model from scratch. We implemented the whole pipeline from preprocessing, model architecture to hyperparameter optimization. To leverage the GPU capabilities of the Jetson Nano, we used pytorch with cuda support. At training time, we iterate multiple times over the gathered training data in batches to optimize the model (stochastic gradient descent), while at inference time, the models weights are loaded into the GPU, and the faces detected by the face-detection module are classified. If there are multiple faces present, evaluation is again done in batch mode to speed the process up and make use of GPU parallelization. #### Architecture In order to keep the architecture simple, yet state-of-the-art, we used a custom CNN with two convolutional layers (using ReLu activiation) and max-pooling. ![image](./images/architecture.png) Our Architecture (Own Image) The green parts display the size of the linear layers and whether to use dropout. They were optimized using a grid-search described in the next chapter. We also tried different architectures, such as using 2x2 convolutional filters with stride instead of max pooling, as seen in the class CNN2 in [models.py](data/models.py), which is proposed by [Springenberg et al. (Pre-print article)](https://arxiv.org/abs/1412.6806). Furthermore, using a third convolutional layer was also experimented with. However, no other architecture managed to come near the performance of this first proposed model in preliminary trials. #### Training and hyperparameters To train in a fast and efficient fashion, we deployed the training on the [bwUniCluster](https://wiki.bwhpc.de/e/Category:BwUniCluster_2.0) infrastructure, where GPU power (NVIDIA Tesla V100) accelerated the training. The training time is only around 10 minutes when training for 20 epochs, which is increased if the augmented dataset is used (tripling the datapoints, leading to training times around 30 minutes). To find the optimal hyperparameters for the model, we parametrized the learning rate, as well as whether to use dropout layers (to potentially combat overfitting), and also the size of the linear layers. The following image shows one path for each experiment of the grid search, colored by the accuracy on the validation set. ![image](./images/grid-search.png) The best run was archived using {Dropout = True, Size 1 = 128, Size 2 = 64, Learning Rate = 0.05}. These hyperparameters resulted in the best accuracy on the validation set, with a value of 92.47%. It also achieved a solid 95.22% on the training-set (note how other models overfittet, especially those without dropout) and 92.4% accuracy on the global holdout/test-set. The development of the accuracies during the training can be seen in the following picture: <img src="./images/accuracy-best-run.svg" width="50%" height="50%" /> We found out that the dropout layers were beneficial to the model performance. Their influence can be seen when grouping the results by whether dropout was used or not. In the following images, all results from the grid-search can be seen. The line represents the mean over all runs for each class (dropout/no dropout): Model accuracy | Running loss :-------------------------:|:-------------------------: ![image](images/overfitting1.png) | ![image](images/overfitting2.png) While the loss was higher due to the dropped layers, the overfitting (i.e. better performance on the train set with worse performance on the test set) was significantly reduced by using dropout. ### Model inspection As the CNN is a black-box regarding how it comes to its final decisions, we applied some techniques from the field of EXplainable AI (XAI), to get an understanding as to how the model detects smile. To this end, [saliency.py](smile_detector/evaluation/saliency.py) can be used to generate saliency maps. As described by [Simonyan et al](https://arxiv.org/pdf/1312.6034.pdf) (p. 4), saliency maps can be generated by classifying one image, then computing the gradients (backpropagation), and then reshaping the gradients to match the computed image. The produced images can then hint, in which areas of the input image the gradients were especially significant. In the following image, the leftmost is the original image, then the saliency map is visualized, while the third image overlays the saliency map onto the original image. ![image](./images/saliency_teeth_example.png) It can be seen how the regions around the teeth contain the highest gradients, however the areas around the nose and eyes also seem to be taken into account. For images of people that smile without showing teeth, the corner of the mouth seemss to be the most relevant feature, as can be seen in this graphic: ![image](./images/saliency_noteeth_example.png) Calculating the saliency for negative images does not make that much sense, as the gradients computed still show which features contributed to the classification "smiling". In this negative sample for example, it can be seen that the visible teeth highly contributed to the prediction score, however the activation across the network was not sufficient to raise the prediction score above the required 0.5, which would have resulted in a false-positive classification: ![image](./images/saliency_neg_example.png) We can conclude that the network has successfully captured which areas of the input images are detrimental to solving the classification task at hand. ## Evaluation We divide the evaluation phase into face detection, smile detection and run-time evaluation. The face detection evaluation deals with the accuracy and performance of the face detector. The smile classifier is evaluated separately from the face detector, focusing on whether or not it can accurately predict if a cropped out face is smiling or not. Finally, the run-time evaluation will target the performance of both processes, as well as the additional overhead created by the remaining features (e.g. displaying the image with UI information). ### Face detection evaluation For the evaluation of a potential face detector, we use the ImagesOfGroups dataset with the adjustments mentioned in the [data preparation](#data-preparation) chapter. The dataset consists of a total of 5075 samples, with a varying number of faces per image. The following graph depicts the distribution: number of images per number of faces. ![image](./images/evaluation/num_faces_per_image_distribution.png) As shown in the image above, over half the samples contain up to five faces per image, with the other half spreading out with decreasing frequency up to 37 faces per image. Each image contains a minimum of two faces. Since our use-case focuses on taking group pictures, we believe this dataset has an excellent distrubtion of classes for the evaluation of the face detector. #### How detections are evaluated To describe how well a prediction matches the ground truth, the intersection-over-union (IoU) is calculated for each prediction and corresponding ground truth bounding box. As seen in the following image, it represents how well the two bounding boxes match with a score from [0, 1]. ![image](./images/evaluation/iou.png)\ *Credit: [Padilla et al., 2020](https://github.com/rafaelpadilla/Object-Detection-Metrics)* However, the IoU alone does not give a clear decision on whether a detection was a true positive (TP) or a false positive (FP). For this, a threshold from [0, 1] can be defined. Objects with an IoU score greater than this threshold are considered a TP. For example, and IoU of 0.5 means that predictions with a 50% overlap with the ground truth bounding box are considered a TP, otherwise a FP. Now that the true and false positives can be distinguished, the precision and recall can be calculated using: ![image](./images/evaluation/precision-and-recall.png)\ *Credit: [Harshit Kumar](https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173)* Since the precision and recall depends on the selected threshold, a range of thresholds are used to determine the average precision. This allows a better comparison between models that would have an equal precision and recall score for a given IoU threshold. As an example, if the IoU threshold were fixed at 0.5, the following models would have the same accuracy: <table> <th>Model A</th><th>Model B</th> <tr><td> | Detection | IoU | Label | - | - | - | | 1 | 0.9 | True | 2 | 0.6 | True | 3 | 0.6 | True </td><td> | Detection | IoU | Label | - | - | - | | 1 | 0.9 | True | 2 | 0.9 | True | 3 | 0.8 | True </td></tr> </table> But over a range of [0.3, 0.7], for example, Model B would have a higher average precision. From this range of thresholds, we can plot the precisions and recalls from a model throughout the range of thresholds. The area under the precision and recall curve is the model's average precision (AP). The mean Average Precision (mAP), a common metric for the evaluation of object detection models, is calculated by taking the mean of the average precision for each class. However, since our face detection models only detects one class, the mean average precision corresponds to the average precision. It is important to note that because our face detection models do not return predictions with a confidence score, the precision-recall curve depicts how the model performs over the IoU thresholds, with low IoU thresholds being the best results for both precision and recall, while high IoU thresholds result in a worse performance for these metrics. This is due to the fact that predictions which are not matched with any ground truth bounding box are directly marked as a false positive, regardless of the current IoU threshold. The same applies to false negatives, as missed ground truth bounding boxes are always considered false negatives, regardless of the IoU treshold. The true positives, on the other hand, are reduced as the IoU threshold increases. With lower true postives, and roughly constant false positives and false negatives, the precision and recall decrease with higher IoU thresholds. #### Cascade filter One of the main benefits benefits of the cascade filter is the faster inference time on only CPU based devices. The following graphic depicts the precision-recall curve for thresholds in a range of [0.3, 0.7] with a step size of 0.05: ![image](./images/evaluation/cascade_filter_precision_recall_curve.png) As seen in the curve: The recall is high, with a lower precision. The average precision of the cascade filter lies around 77.8%. #### MTCNN The MTCNN benefits from a more modern architecture and achieves a much higher precision, while having to sacrifice some recall. The main drawback of the MTCNN is the slower inference times on CPU. To combat this, the MTCNN model parameters can be finetuned to trade off some performance for faster inference times. One of most benefitial parameters is the min_face_size, with higher sizes being more optimal for performance as the model will avoid searching for small faces. The following graphic represents the precision-recall curves for various min_face sizes and thresholds in a range of [0.3, 0.7] with a step size of 0.05. ![image](./images/evaluation/mtcnn_precision_recall_curves.png) As seen in the images, all variations of the MTCNN model perform very well in terms of precision and differ mostly in their recall score and average frames per second. #### Comparison of results Here are the precision-recall curves of both models, including the variations of MTCNN model with multiple min_face sizes. ![image](./images/evaluation/precion_recall_all_face_detectors.png) Even though the Cascade Filter manages to score a better recall overall, the precision is significantly below that of the MTCNN, leading to a lower average precision. In addition, even though the Cascade Filter performs faster on a CPU, the GPU enhanced MTCNN outperforms the Cascade filter both in terms of average precision and performance. | Model | Average Precision | Average FPS - Macbook Pro 2021 (CPU) | Average FPS - Jeston Nano (GPU) | - | - | - | - | | Cascade Filter | 77.8% | 20 | 1.9 | MTCNN-20_minface | 88.6% | 10 | 2.5 | MTCNN-30_minface | 88.4% | 18 | 4.6 | MTCNN-40_minface | 83.8% | 24 | 6.2 In terms of average precision, the MTCNN with a min_face_size of 20 has the highest average precision of 88.6%. However, the MTCNN with a min_face_size of 30 only slightly underperforms the MTCNN-20, but achieves an 80% increase in FPS. For this reason, we chose the MTCNN with a min_face_size of 30 as our final face detector. ### Smile classifier evaluation This section describes the evaluation of final smile classifier after the hyperparameter tuning mentioned in [modeling](#modeling). The test dataset used for the evaluation contains a total of 5511 samples (from the augmented dataset). The following picture represents the precision-recall curve over a threshold range of [0, 1]. A prediction is considered a smile when the output of the model is greater than the given threshold. ![image](./images/evaluation/smile_classifier_eval_augmented_dataset.png) The average precision (area under curve) resulted to be around 96%, with the optimal combination between precision and recall to be at a threshold of roughly 0.5, which corresponds to threshold used during training. At this threshold, the model scores on both the base and augmented dataset are summarized as follows: | Dataset | Precision | Recall | F1 Score | Accuracy | - | - | - | - | - | | Base dataset (1837 samples) | 95.4% | 90.6% | 93.0% | 95.0% | Augmented dataset (5511 samples) | 90.8% | 86.0% | 88.4% | 91.8% | As can be seen in the results, the classifier performed very well on the base dataset, and has a lower, yet still decent, score in the more difficult, augmented dataset. ### Run-time evaluation For the run-time evaluation of both face detector and smile classifier in one instance, the ImagesOfGroups dataset is used once again. As it contains samples that both face detector and classifier have not seen during training. It also represents - more or less - the type of images that are going to be used in production of this application. The face detector used was the MTCNN with a min_face_size of 30x30 pixels. The smile classifier weights were loaded from CNN_dropout-True_lr-0-005_l_size-128-64.pth, which can be found under [models](smile_detector/smile_detection/trained_models). The average frames per second were measured from just before loading in the image to memory until right before the image is displayed, with a total of 5075 samples from the dataset. The results are summarized as follows: | Hardware used | Average FPS | - | - | | Macbook Pro 2021 (CPU) | 17.5 | Jetson Nano Developer Kit 2022 (GPU) | 1.8 Due to the powerful CPU on the Macbook, the application is capable of reaching a performance of 17.5 FPS. However, since the application is meant to be deployed on mobile devices (e.g. phone), the focus is on the 1.8 average FPS of the Jetson Nano. However, this evaluation was done on a dataset with high resolution images, which do not correspond to the images captured by the Jetson Nano's camera. In real world performance, this average FPS should therefore be better. As seen in the difference between combined performance vs individual testing, the face detector appears to be the main bottleneck for performance. Therefore, we settled on keeping a cached list of detected face regions and skip the face detection on the subsequent n frames, as we assume that faces will only move slightly when preparing to take a picture. The smile classifier still processes every frame, as the change between smiling and not smiling occurs significantly faster. The images in the ImagesOfGroup dataset are mostly independent of each other and would not be an accurate representation of the real world performance with the performance enhancing frame-skipping, so the following summary of performance with frame skipping is based on our observations during live demos (not calculated): | Hardware used | FPS | FPS (1 frame skip) | FPS (2 frame skip) | - | - | - | - | Jetson Nano Developer Kit 2022 (GPU) | ~6.5 | ~10.5 | ~17.5 For the final demo, based on our observations, we decided on skipping one frame for the best combination of performance and accuracy. ## Deployment This section describes the deployment phase of the project. We shortly explain how we deployed our models. For allowing a safe deploying on the device, the required models have to function properly. Since we used standard libraries and packages for implementing both the face-detector and the smile-classification, they can be installed on any device with an internet connection. Our code is available on Gitlab. However, due to some dependencies restrictions, it is important that the exact same version of the packages is installed. To ease the deployment, we listed all of them with their required version. To monitor the performance and fix upcoming incompatibilities, we ran our application on different devices, which should make it more robust when deployed on new devices.