# ML-03-LeibetsederRiegerRoitherSeiberl
## Assignment 3 - Gesture Recognition
### Task definition
Implement and decide on the best machine learning algorithm and model. Train different models and showcase the approach - do comparisons, understand the data and the corresponding results to the given parameters on the machine learning processes implemented.

### Data Understanding
To work with the data, minimal understanding of the dataset is needed. Therefore taking a look, with plots and other analysis tools is quite welcome. Analysis should be done overall files. As this assignment is dealing with gestures recorded on an Android Wear device, one can easily figure out that this dataset is composed of accelerometer values on given x-, y- and z-axis by doing plots as shown.
The dataset is distributed over three .csv files (raw_data_wear_x.csv, raw_data_wear_y.csv, raw_data_war_z.csv). Each file represents the acceleration values of one axis.
The columns are as follows:
1. gesture
2. participantNr
3. sampleNr
4. N acceleration values (different lengths per sample)
**One sample e.g.:**
**X-Axis:** left,0,0,-3.8217444,-3.7738605,-4.1235633,-4.0702925,-4.0153756, ...
**Y-Axis:** left,0,0,-3.1598973,-3.1188967,-3.458574,-3.3167176,-3.2155626, ...
**Z-Axis:** left,0,0,1.9331683,1.9261353,1.925836,2.0039468,1.9434932, ...
**Same sample visualized in a 3D plot:**
**Figure 1** visualizes the acceleration values of the x,y and z-axis of a sample. It shows the movement of an arm going from right to left. As seen in the figure there is quite some noise in the sample which is not desired.
The sample was interpolated up to 200 values, scaled and finally filtered. The [Savitzky-Golay-Filter](https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter) was used with a window length of 69 values and a polyorder of 2. The noise got reduced significantly as seen in **Figure 2** and **Figure 4**.
| Not preprocessed | Preprocessed |
|------------------|--------------|
| Figure 1: Visualizing the acceleration values of the x,y, z-axis of sample nr 0 in a 3D plot. |  Figure 2: Visualizing the interpolated, scaled and filtered acceleration values of the x,y and z-axis of sample #6 in a 3D plot.|
| Figure 3: Visualizing the acceleration values of the x,y, z-axis of sample nr 6 in a 3D plot.| Figure 4: Visualizing the interpolated, scaled and filtered acceleration values of the x,y and z-axis of sample #6 in a 3D plot.|
### Data analysis
The first step was to analyse the balance of the given dataset. The 2160 samples are equaly distributed. So every movement has 270 samples.
<table>
<tr><th>Movement Samples </th><th>Participants</th></tr>
<tr><td>
| Movement | Nr of samples |
|-----------|---------------|
| right | 270 |
| left | 270 |
| up | 270 |
| down | 270 |
| triangle | 270 |
| square | 270 |
| circleCw | 270 |
| circleCcw | 270 |
</td><td>
| Participant Nr. | Nr of samples |
|-----------------|---------------|
| 0 | 240 |
| 1 | 240 |
| 2 | 240 |
| 3 | 240 |
| 4 | 240 |
| 5 | 240 |
| 6 | 240 |
| 7 | 240 |
| 8 | 240 |
</td></tr> </table>
> Finding: The given dataset is **balanced**.
#### Data Preprocessing
By preprocessing the data it gets prepared for further processing, being it other calculations and feature extractions made or do an immediate train-test-split. One can use different types of preprocessing steps, e.g. using normalization or interpolation on given datasets. In the case of accelerometer values, the group decided to do interpolation, scaling and filtering the data in the written order.
##### Interpolation
For interpolation, we decided to try 3 approaches. The first one was to interpolate with the NAN-Values, in the second one NAN-Values got replaced with 0 and in the third one, the NAN-Values got removed. In the end, we decide to go with the third version as we got better results after the cross-validation step.
##### Scaling
After interpolation, the data got scaled to normalize the range of independent of our features.
##### Filtering
For filtering, we decided to go with the *Savitzky-Golay-Filter* to reduce the noise in the data. We also tried the Median Filter, but this filter does not fit well with the requirements of the assignment.
### Feature Extraction
Feature selection/extraction means deciding on the features to use for the machine learning algorithm to train on. *PCA* is often used to determine the meaningful features, also tweakable by the selection of features to take off the given dataset(s). In the case of accelerometer values from the gestures, the group decided on trying two different approaches with autocorrelation before *PCA* and one without autocorrelation. The approach without autocorrelation before the feature selection process happens is directly moving on with *PCA*.
#### Autocorrelation
With the autocorrelation it is possible to find the Signal to Noise ratio and if there is any correlation between the data with a lagged version of the data. In the two pictures below we can see that the values in a range from 0 to 600 are more correlated than values from 1000 to 2000. With autocorrelation we can see how much impact past values have on current values.
| | |
|---|---|
|  | |
#### Variance extraction
As we are dealing with approx. 600 features we need to define the meaningful features from our dataset. Therefore we used *PCA* to get a variance ratio and read the variance in the features from the plot. Given the figure below we can see that the features stop to vary at around feature #30.

```python=
# keep only numeric features
data = data._get_numeric_data()
x = data # convert the data into a numpy array
min_amount = min(data.shape[0], data.shape[1])
covar_matrix = PCA(n_components=min_amount)
covar_matrix.fit(x)
# calculate variance ratios
variance = covar_matrix.explained_variance_ratio_
# var - cumulative sum of variance explained with [n] features
var = np.cumsum(np.round(variance, decimals=3) * 100)
variance_num = np.amin(np.where(var == np.amax(var))) + 1
# original dataset + number of either min of features or data rows
variance_features = min(variance_num, min_amount)
```
With *PCA* we found out our data can be represented with 28 Features.
#### Correlation Extraction
| | |
|---|---|
|  Correlation plot of all features. | Correlation of all features + autocorrelation |
|  Correlation plot after pca for train data|  Correlation plot after pca for test data|
As we can see in the pictures with the use of all features there is a huge correlation. Adding the autocorrelation to the data the correlation gets a little bit clearer but not very well. Looking at the third picture with only the 28 selected features we can see that the big correlation is nearly removed, for test data it is removed.
Using only the 28 best features (high variance features) we can reduce our dimensions and computation time for the machine learning algorithm upfront.
### Data Partitioning (Train-Test-Split)
We decided to split the data in data_test, data_train, target_test, and target_train. The trainsets contain 75 % of the data and the test sets the remaining 25 %. Before the data split we applied the [Savitzky-Golay-Filter](https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter) to reduce the noise in the data. We also tried the [Median](https://en.wikipedia.org/wiki/Median_filter) Filter, but this filter does not fit well to the requirements of the assignment.
```python=
data_train, data_test,
target_train, target_test = train_test_split(df_savgol.iloc[:, 3:],
le.transform(df_savgol.iloc[:, 0]),
train_size=0.75, random_state=123456)
```
### Model
#### Model Tuning and Selection
In this step, a machine learning algorithm is trained for later evaluation. Using different types of machine learning algorithms and tweaking the algorithms hyperparameters is happening here.
#### Model Algorithm Selection
The selection step should be based on accuracy and the complexity of the model trained. If a question can result in either the answers A or B, a less complex *CART* can be used to whit an accuracy of 0,9500. A neural network with accuracy of 0,9800 can be used as well, but it is far more complex than a *CART*. One always needs to ask if it is worth to go for the best model trained or make a compromise and use the less complex one. As *CART* are still performing very good when using this algorithms model, one is maybe able to introduce the machine learning approach to e.g. much more devices as when going with the more complex neural networks. It just really depends on the domain and use case someone wants to introduce machine learning approaches too.
Algorithms selected to train a machine learning model is happening here. The group started with a selection of *KNN*, *SVC*, *CART*, *RF*, *NN*, *LDA*, *LR* and *NB* to get an overview of the performance of different machine learning algorithms.
#### Hyperparameter Tuning
Tuning the hyperparameters of machine learning algorithms changes the outcome of the prediction. E.g. by using 1024 perceptrons and 1000 iterations instead of 200 iterations the algorithm is trained on, the neural network's performance improved from 0,8814 to 0,9667. This is just an example of how the performance can be tweaked by changing the hyperparameters and doesn't relate to the outcome of this exercise, even though we accomplished this value without doing any feature-selection beforehand. The parameters used now result in the best accuracy for each model. Commented out are other parameters that were tried out in the process.
```python
params = {
'KNN': {'n_neighbors': [1, 3, 5, 7, 10, 15]},
'SVC': [
{'kernel': ['linear'], 'C': [0.001, 0.01, 0.1, 1, 10]},
{'kernel': ['rbf'], 'C': [0.001, 0.01, 0.1, 1, 10],
'gamma': [0.001, 1, 0.1, 1]}
],
'CART': {
'max_features': ['auto'], # 'max_features': ['auto', 'log2'],
'min_samples_leaf': [1], # 'min_samples_leaf': [1, 5, 10],
'max_depth': [25], # 'max_depth': [1, 5, 10, 20, 25, 50],
},
'RF': {
'max_features': ['log2'], # 'auto',
'max_depth': [25],
'n_estimators': [1000]
},
'NN': {
'learning_rate': ['constant', 'invscaling', 'adaptive'],
'hidden_layer_sizes': [(1024,)], # , (10,), (1024, 512, 16),
# (13, 13, 13), (20, 14, 8)
'shuffle': [True],
'activation': ['relu'], # 'identity', 'logistic', 'tanh',
'random_state': [True],
'batch_size': ['auto'],
'early_stopping': [True],
'max_iter': [1000] # , 500, 200, 100
},
'LDA': {'solver': ['lsqr', 'svd']},
'LR': {'solver': ['liblinear'], 'multi_class': ['ovr']},
'NB': {}
}
```
We used the above parameters to train our models from our training data.
Using:
- different amount of neighbors for *KNN*
- different kernels and values for C for *SVC*
- different maximum depths for the *Decision Tree* and then decided for the best performing
- different maximum depths for a *Random Forest* in combination with different values for estimation and then decided for the best one
- different type of *NN* with an early stop also decided for the best one
- *LDA* with the lsgr solver,
- multi-class and solver parameter for the *Logistic Regression*
- the *Naive Bayes* without any specification
#### Fit Model Algorithms
After the tweaking of the hyperparameters of each model, the algorithms are trained on the test-dataset, coming from the original dataset, which was split into test and train. For each parameter set in the tuning process, the algorithm will be trained on different combinations. E.g. using 1024 perceptrons and 100 and 1000 iterations, the algorithm will be trained for 1024 perceptrons with 100 iterations and 1024 perceptrons and 1000 iterations.
After deciding for the parameters on the algorithms, a *GridSearchCV* was used to fit the models with the predefined parameters. The code can be found in the corresponding .py-file.
### Cross-Validation
This step is used to check on trained models and their performance. Using this step to prepare the model for evaluation by cross-validating the trained hyperparameters on the algorithm by doing different splits of the train and test sets and evaluate the performance with the outcome of the CV step. Using the results of the predictions made in the step before, one can plot and compare the predictions made by the trained algorithms.
#### Evaluation
By evaluating the outcome of the best estimator trained with the data of the cross-validation provides, one can tell if the hyperparameters used were a good or a bad choice, also if the model is overfitting. When evaluating predictions of the trained models one can use different approaches to do this. Using the accuracy in combination with boxplot comparisons or looking at confusion matrices one can e.g. a model overfit. This happens if the train data perform enormously better on the training data than on the test data.
#### Predictions and Comparison of selected Properties
##### Accuracy Comparison
To get comparable measurements for the models we compared all the accuracies of the models. Both for train and test data.
##### Cohen-Kappa-Comparison
With the *Cohen's Kappa Coefficient* we can measure the inter-rater reliability. Its a more robust measure than a simple percent agreement calculation. The inter-rater reliability is the degree of agreement among raters. It tells you how much better your classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class.
##### MAE, MSE Comparison
It is used for rating and evaluating our models. Tells us how error-prone our models are.
| Tested Models | Accuracy | MAE | MSE | Cohen's Kappa |
|---------------|----------|---------|---------|---------------|
| KNN | 0.9741 | 0.08889 | 0.39259 | 0.97031 |
| SVC | 0.9722 | 0.06667 | 0.22963 | 0.9682 |
| CART | 0.7944 | 0.58889 | 2.38519 | 0.76485 |
| RF | 0.9630 | 0.11852 | 0.5 | 0.95761 |
| NN | 0.9630 | 0.10556 | 0.43519 | 0.95758 |
| LDA | 0.8259 | 0.58889 | 2.57037 | 0.80079 |
| LR | 0.8481 | 0.49815 | 2.19444 | 0.82625 |
| NB | 0.8926 | 0.36111 | 1.67593 | 0.87707 |
According to the table we can see that CART was the worst model with 79 %. Looking at the Cohen's Kappa coefficient CART is performing even worse. Generally one can see that Cohen's Kappa evaluates the models worse than the Accuracy. Considering the MSE and MAE LDA, LR and CART are most error-prone which is reflectable to the accuracies of the models.
#### Cross-Validation Boxplot Comparison
Using the results of the cross-validation-process, one can compare the outcomes and performance of trained models. First, on the training data, where one can see with *KNN*, *SVC* and *RF* we are near to 100%. Whereas *CART* only has about 83%.

More interesting are the validations on the test data, as the algorithms have never seen this data before.

Here the differences in the data can be seen very well - looking at *KNN* it can be seen that the prediction was very good. The variance in *CART*, *SVC* and *NN* increased. *LDA* and *LR* perform much better than on the training data - which is very interesting. Also, *NB* increased in accuracy and variance decreased.
#### Confusion Matrix Comparison
By plotting the confusion matrixes for every model we undermine our findings based on the accuracies.
| | |
|---|---|
|  |  |
|  |  |
|  |  |
|  |  |
### Model Decision
*Random Forest* and *Neural Networks* show a very high accuracy compared to other models. *KNN* and *SVC* were not considered because we think they are overfitting. While *RF* and *NN* take time to train and produce similar results, **RF** was chosen over *NN* due to faster evaluation of the test data. *RF* took on average 102.19s and *NN* on average 131.025s in cross-validation-process performed on the test data.