--- tags: machine learning --- # ML-02-LeibetsederRiegerRoitherSeiberl ## Face recognition assignment ### Task Definition Implement and decide on the best machine learning algorithm and model. Train different models and showcase the approach - do comparisons, understand the data and the corresponding results to the given parameters on the machine learning processes implemented. ### Data Understanding The dataset was given as a batch of 50x50pixels *".png"*-files. The pictures are in greyscale and show 30 people with different sorts of facial expressions and angles which can be seen in the examples below. Each of the 30 faces has 20 different pictures taken from different angles. With the exception of the persons with the ID 23, 18, 24, 5. The faces on the pictures are centered which means we don't have to do advanced preprocessing on the images. | ![](https://i.imgur.com/ukHAXH1.png) | ![](https://i.imgur.com/wNqiuFh.png) | ![](https://i.imgur.com/NBCXnas.png) | ![](https://i.imgur.com/4nIgtjx.png) | ![](https://i.imgur.com/IYaMC7x.png) | |---|---|---|---|---| | ![](https://i.imgur.com/s1rS3Sq.png) | ![](https://i.imgur.com/Eku1wv1.png) | ![](https://i.imgur.com/ibe7d83.png) | ![](https://i.imgur.com/DU4FuTG.png) | ![](https://i.imgur.com/YMrrkfB.png) | | ![](https://i.imgur.com/dMwwMgJ.png) | ![](https://i.imgur.com/EfrRjbq.png) | ![](https://i.imgur.com/FQglwxp.png) | ![](https://i.imgur.com/kKZ8siB.png) | ![](https://i.imgur.com/vfjYUta.png) | | ![](https://i.imgur.com/Hd9r9mz.png) | ![](https://i.imgur.com/RdvPjnC.png) | ![](https://i.imgur.com/WJ7avcm.png) | ![](https://i.imgur.com/hRumhSa.png) | | Using pandas *DataFrame* we can extract the number of samples and amount of features: ```python print(Colors.positive + str(df_faces.shape[0]) + " Samples and " + str(df_faces.shape[1]) + " Features") # [+] 574 Samples and 2501 Features ``` ### Data Splitting The group decided to use 75% of the samples as training data and 25% as test data. ```python data_train, data_test, target_train, target_test = train_test_split(df_faces.iloc[:, 1:], df_faces.iloc[:, 0], train_size=0.75, random_state=123456) [*] Splitting into training and test set... [*] 430 Samples training data set [*] 144 Samples test data set ``` ### Variance extraction As we are dealing with 2501 features and therefore the same amount of dimensions we need to define the meaningful features from our dataset. *PCA* can be used to get a variance ratio and read the variance in the features from the plot. Given the figure below we can see that the features stop to vary at around feature #100. ![](https://i.imgur.com/W83RHeK.png) ```python # original dataset + number of either min of features or data rows min_amount = min(df_faces.shape[0], df_faces.shape[1]) variance = pca_variance(df_faces, min_amount) # getting the min-index of the variance array # by looking through the array to tell where the # variance stops to increase variance_features = np.amin(np.where(variance == np.amax(variance))) # = 104 ``` ### Correlation Extraction We now know we can use 104 features to distinguish people from our dataset. Looking at a correlation plot for 2501 features and the 104 meaningful features make a big difference in plotting the features. This also changes the outcome of the correlation plots as seen in the figures below. ![](https://i.imgur.com/qqOohlJ.png) By looking at the picture above, 2501 features are plotted. Interpreting this plot, one can tell that there are high correlations between the features starting at 0 to 500 on x and y and a much lighter correlation starting around 800 until the end of the featureset. ![](https://i.imgur.com/MdqHztL.png) By only looking at the extracted features from the *PCA* analysis (104 features) the correlation between the features gets much clearer and easier to read. The correlation plot of the 104 extracted variance features provides a much clearer picture of the corresponding features. Above, 45 features are correlating with each other, the same is happening for features below 45 on each axis. Using only the 104 best features (high variance features) we can reduce our dimensions and computation time for the machine learning algorithm upfront. ```python data_filtered = SelectKBest(chi2, k=variance_features) .fit_transform(data_train, target_train) df_data_filtered = pd.DataFrame.from_records(data_filtered) print(Colors.positive + "Filtered Data:" + str(df_data_filtered.shape)) show_correlation_plot(df_data_filtered.corr(), df_data_filtered) ``` ##### Take some of the original pixel features of your data that lie near to each other (e.g. first 100) and do a correlation plot for them. What do you see? We can tell that features near to each other correlate, because they don't differ much in their value as these values are greyscale data. ### Eigenfaces The eigenfaces after applying _PCA_ with 104 features can still be distinguished although some faces might be hard to see. ![](https://i.imgur.com/w50FrlO.png) By doing *PCA* with 104 features instead of 2500, the complexity of the machine learning object will be reduced by 95% [`(2500-104)*100/2500`]. ```python # PCA PCA = PCA(variance_features) PCA.fit(data_train) data_train = pca.transform(data_train) data_test = pca.transform(data_test) ``` Applying the *PCA* transformation on the data to reduce complexity. ### Models The group decided to go with *SVC*, *Decision Tree*, *KNN*, *Random Forest*, *Neural Networs*, *Logistic Regression*, *Naive Bayes* and *LDA*. To get a good overview over different models we tried to test our data with a few different models. #### K-Nearest Neighbors [KNN] ([Source](https://hagenberg.elearning.fh-ooe.at/pluginfile.php/358620/mod_resource/content/10/MCM_ML_04_regression_classification.pdf)) *KNN* is one of the most simple classification algorithms. Some of its properties are non-linearity, classification on **K** and can be used with different distance measurements, like Euclidean or Manhattan distance. #### Support Vector Classification [SVC] ([Source](https://hagenberg.elearning.fh-ooe.at/pluginfile.php/358637/mod_resource/content/7/MCM_ML_10_svm.pdf)) It is a sub-class of *Support Vector Machines* and developed, as the name already states, for classification. Using different kernels and parameters it is possible to learn and try different approaches on *SVM*. In our case we are dealing with faces of different persons (classes) - therefore this could lead to a good outcome. #### Decision Tree [CART] ([Source](https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052)) A *Decision Tree* can be used for classification and regression. The tree is built on decisions while moving down different nodes to use to get to a child (decision/class in the end). #### Random Forest [RF] ([Source](https://www.analyticsvidhya.com/blog/2014/06/introduction-random-forest-simplified/)) *Random Forest* is like a *Decision Tree* (CART), but *Random Forest* tries different ways instead of just doing one approach. #### Linear Discriminant Analysis [LDA] ([Source](https://machinelearningmastery.com/linear-discriminant-analysis-for-machine-learning/)) *LDA* is a linear classification technique, where more than two classes are involved. can get unstable with too few examples and will consist of statistical properties that are estimated from the data and put into the *LDA* equation used for the classification. #### Neural Networks [NN] ([Source](https://www.codementor.io/james_aka_yale/a-gentle-introduction-to-neural-networks-for-machine-learning-hkijvz7lp)) Neural networks consist of perceptrons, which are each connected to the next layer in the network. Feeded with input and deciding on a function if fireing or not. #### Logistic Regression [LR] ([Source](https://machinelearningmastery.com/logistic-regression-for-machine-learning/)) Logistic regression uses an equation as the representation where input values are combined linearly using weights or coefficient values to predict an output value. #### Gaussian Naive Bayes [NB] ([Source](https://machinelearningmastery.com/naive-bayes-for-machine-learning/)) Naive Bayes is a classification algorithm for binary and multi-class classification problems. A list of probabilities is stored when a naive bayes model is learned. ### Model Parameters ```python params = { 'KNN': { 'n_neighbors': [1, 3, 5, 7, 10, 15] }, 'SVC': [ { 'kernel': ['linear'], 'C': [0.001, 0.01, 0.1, 1, 10] }, { 'kernel': ['rbf'], 'C': [0.001, 0.01, 0.1, 1, 10], 'gamma': [0.001, 1, 0.1, 1] } ], 'CART': { 'max_depth': [1, 5, 10, 20] }, 'RF': { 'max_depth': [1, 5, 10, 20], 'n_estimators': [1, 3, 5, 10] }, 'NN': { 'hidden_layer_sizes': [(1024,)], 'batch_size': ['auto'], 'early_stopping': [True] }, 'LDA': { 'solver': ['lsqr'] }, 'NB': {}, 'LR': { 'solver': ['liblinear'], 'multi_class': ['ovr'] } } ``` We used the above parameters to train our models from our training data. Using: - different amount of neighbors for KNN, - different kernels and values for **C** for SVC, - different maximum depths for the Decision Tree, - as well as for random forest in combination with different values for estimation, - one type of NN with an early stop, - LDA with the _lsgr_ solver, - multi-class and solver parameter for the logistic regression, - and the naive bayses without any specification. ### Model Fitting After deciding for the parameters on the algorithms, a _GridSearchCV_ was used to fit the models with the predefined parameters. The code can be found in the corresponding .py-file as this training is a standard procedure anyways. ### Prediction and Evaluation #### Predicting We decided to move on with an accuracy score on the predicted classes, which resulted in the following accuary-table for each model trained: | | KNN | SVC | CART | RF | LDA | NN | LR | NB | |-----------|--------|--------|---------|---------|---------|--------|------|------| | Accuracy | 0.986 | 1.000 | 0.771 | 0.882 | 1.000 | 0.979 |0.993|0.986| Comparing the accuracy of the Cross-Validation, we can see that LDA and SVC accuracy drops when using the test data for validation. Therefore interpreting the results we think that LDA and SVC are "overfitted". The algorithms can't handle the data well enough to be able to distinguish the faces and are therefore resulting in false positives. ([Source](https://stats.stackexchange.com/questions/29385/collinear-variables-in-multiclass-lda-training)) #### Mean Absolute Error and Mean Squared Error Mean Absolute Error ([Source](https://medium.com/@ewuramaminka/mean-absolute-error-mae-machine-learning-ml-b9b4afc63077)) is often used to rate and evaluate machine learning regression models. We now can tell that our models, the ones based on regression, are off for the following errors: | | KNN | SVC | CART | RF | LDA | NN | LR | NB | |-----------|--------|--------|---------|---------|---------|--------|------|------| | MAE | 0.229 | 0.000 | 3.021 | 1.722 | 0.000 | 0.195 |0.070 |0.111| Based on the table data we are off ~0.667%, if we just consider our working (not overfitted) models ~0,889%. Mean Squared Error ([Source](https://www.bmc.com/blogs/mean-squared-error-r2-and-variance-in-regression-analysis/)) is the average of the square of the errors. The larger the number the larger the error. Error in this case means the difference between the observed values and the predicted ones: | | KNN | SVC | CART | RF | LDA | NN | LR | NB | |-----------|--------|--------|---------|---------|---------|--------|------|------| | MSE | 4.784 | 0.000 | 54.924 | 54.924 | 0.000 | 1.958 |0.695 |0.889| Based on the table we have an average error rate of 14,772%, considering only our working models again about 19,696%. We are aware of the fact that NN can't really be estimated based on this metrics as they are neural networks. #### Cross-Validation To imporove upon our findings we used cross-validation for our predictions. First on the training data, where one can see with *NN*, *LR*, *KNN* and *NB* we are pretty close on 99% accuracy. ![](https://i.imgur.com/EzTjTVy.png) More insteresting are the validations on the test data, as the algorithms have never seen this data. ![](https://i.imgur.com/EvEoRzQ.png) Here the differences on the data can be seen very well - looking at LDA the training of the model got overfitted. The results are worse than a normal simple error. *SVC* is performing much worse but still has a good rate tween 80-100%. The two which work best for the test data are neural networks and the logistic regression model. Both are performing well, still *LR* has better accuracy. One also can see from the boxplot that both *CART* and *NB* have a huge variance in the results and do much worse on the test data, where *NN* and *LR* still shine in variance. *** ### What is the best approach (features, model, . . . ) you can come up with to successfully distinguish people? What do you think of the data/the results? Chose appropriate metrics to underline your statements Refering to the results before, especially the ones from *NN* and *LR*, we would consider to use one of these two. Leaning more towards *LR* as it has less complexity and is more reliable, when looking at the test results we got from *MSE* and *MAE* and the accuracy from our test data. ### Further Questions - #### What happens if we build our model from such a data set, then somebody not part of it uses a system where the model is deployed? The one would not be recognized or wrongly recognized from our model as he or she were not part of the training data. The models only work for people which are part of the dataset provided beforehand. - #### What could you do about it/how could such systems possibly work? The model can learn new faces by itself provided that the face is preprocessed correctly. *Feed Forward* training could be used on the neural network to increase the accuracy.