---
tags: coderschool, week 5, machine learning
---
# Week 5: Machine Learning 2
Quote of the Week:
> "If the task is too hard for human, it is probably too hard for the machine too"
**Table of content**
[TOC]
___
## Day 1: Organizing ML project
* Reference:
https://chrisalbon.com/
* Further review: https://colab.research.google.com/drive/15fLQq07kO9aLzD5gIhU8_j1XrGgsBId7#scrollTo=LFHHF4FHRWPl
### The Checklist
1. Frame the problem and look at the big picture
2. Get the data
3. Explore the data for insights
4. Prepare the data for patterns to ML algorithms
5. Explore models and shortlist the best ones
6. Fine-tune models and combinations for best solution
7. Present the solution
8. Launch, monitor and maintain
#### 1. Frame the problem
1. Objective in business terms
2. How to apply the solution? (Who? How?)
3. Look at current solutions (baseline model)
4. Measurement for the performance
5. Domain knowledge for the problem (available expert)
6. Methods to solve the problem (manually)
7. List down assumptions
8. Verify assumptions
#### 2. Get the data
1. List the data and the amount
2. Check the space it takes
3. Check legal obligations
4. Create workspace
5. Get the data
6. Convert data to a format for easy manipulation
7. Ensure deletion or protection of sensitive information
8. Sample a test set and leave it alone
#### 3. Explore the data
1. Create a copy or sample
2. Use Jupyter Notebook for EDA
3. Understand attributes and characteristics
4. Visualize data
5. Study correlations between attributes
6. Find manual solution for the problem
7. Get more useful data if necessary
8. Documentation
#### 4. Prepare the data
1. Use copies of data
2. Write functions for data transformation to be applied
3. Clean data
4. Select features
5. Engineer features
6. Scale, standardize or normalize features
#### 5. Explore different models
1. If the data is huge, sample smaller training sets
2. Try to automate as much as possible
3. Train many quick-and-dirty models with standard parameters
4. Measure and compare their performance with k-fold cross-validation
5. Analyze the most significant variables for each algorithms
6. Analyse the types of errors the models make
7. Perform a quick round of feature selection and engineering
8. Shortlist most promising models
#### 6. Fine-tune the system
1. Use as much as possible, automate what you can
2. Fine-tune the hyper-parameters
3. Treat your data transformation choices as hyper-parameters
4. Prefer Random Search over Grid Search
5. Try Ensemble methods
6. Test the system on the test set if confident then estimate the generalization error (Ex: data mismatch)
#### 7. Presentation
1. Documentation
2. Prepare presentation
3. Highlight big picture
4. Explain solutions for business objective
5. Show interesting points during processes
6. Efficient visualizations and clear statements
#### 8. Launch
1. Prepare solution for production
2. Write monitoring code to check live performance
3. Write codes for alerting stoppage
4. Retrain models with new data
___
## Day 2: Decision Tree - Random Forest
### Decision Tree Classifier

* What is Decision Tree?
Decision tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.
* What are the nodes in decision tree?
* Root node: entire population or sample, divided into two or more homogeneous sets.
* Decision Node: a sub-node splitted into further sub-nodes is called decision node.
* Parent/child node: A node divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are the child of parent node.
* What is gini index?
Gini index or Gini impurity measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen. If all the elements belong to a single class, then it can be called pure. The degree of Gini index varies between 0 and 1, where 0 denotes that all elements belong to a certain class or if there exists only one class, and 1 denotes that the elements are randomly distributed across various classes. A Gini Index of 0.5 denotes equally distributed elements into some classes.
* Gini index formula

* Favors larger partitions.
* Uses squared proportion of classes.
* Perfectly classified, Gini Index would be zero.
* Evenly distributed would be 1 – (1/# Classes).
* You want a variable split that has a low Gini Index.
* Decision Trees can overfit/underfit.

* What are overfitting and underfitting?
* **Overfitting** happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.
* **Underfitting** refers to a model that can neither model the training data nor generalize to new data. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.
* Techniques to limit overfitting:
* Use a resampling technique to estimate model accuracy.
* Hold back a validation dataset.
* Wisdom of the crowd: https://en.wikipedia.org/wiki/Wisdom_of_the_crowd
```
n_models = 10000
n_votes = 10000
def predict():
return (np.random.rand() <= 0.51)[0]
for _ in range(n_votes):
sum_answer = 0
for i in range(n_models):
sum_answer += predict()
sum_votes += sum_answer > (n_models//2)
sum_votes/n_votes
sum([(np.random.rand(n)<= 0.51).sum() > (n//2) for i in range(votes)])/votes
```
### Random Forest

* What is random forest?
The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random:
* Random sampling of training data points when building trees
* Random subsets of features considered when splitting nodes
* How does it work?
The random forest combines hundreds or thousands of decision trees, trains each one on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features. The final predictions of the random forest are made by averaging the predictions of each individual tree.
* How to tune hyperparameters?
* Grid Search
It simply builds a model for each possible combination of all of the hyperparameter values provided, evaluating each model, and selecting the architecture which produces the best results.
Example: Performing grid search over the defined hyperparameter space
```
n_estimators = [10, 50, 100, 200]
max_depth = [3, 10, 20, 40]
```
The results
```
RandomForestClassifier(n_estimators=10, max_depth=3)
RandomForestClassifier(n_estimators=10, max_depth=10)
RandomForestClassifier(n_estimators=10, max_depth=20)
RandomForestClassifier(n_estimators=10, max_depth=40)
RandomForestClassifier(n_estimators=50, max_depth=3)
RandomForestClassifier(n_estimators=50, max_depth=10)
RandomForestClassifier(n_estimators=50, max_depth=20)
RandomForestClassifier(n_estimators=50, max_depth=40)
RandomForestClassifier(n_estimators=100, max_depth=3)
RandomForestClassifier(n_estimators=100, max_depth=10)
RandomForestClassifier(n_estimators=100, max_depth=20)
RandomForestClassifier(n_estimators=100, max_depth=40)
RandomForestClassifier(n_estimators=200, max_depth=3)
RandomForestClassifier(n_estimators=200, max_depth=10)
RandomForestClassifier(n_estimators=200, max_depth=20)
RandomForestClassifier(n_estimators=200, max_depth=40)
```
Each model would be fit to the training data and evaluated on the validation data.
* Random Search
Random search differs from grid search in that it doesn't need a discrete set of values to explore for each hyperparameter. It will sample a statistical distribution for each hyperparameter from random values.
Example: Defining a sampling distribution for each hyperparameter
```
from scipy.stats import expon as sp_expon
from scipy.stats import randint as sp_randint
n_estimators = sp_expon(scale=100)
max_depth = sp_randint(1, 40)
```
One of the main theoretical backings to motivate the use of random search in place of grid search is the fact that for most cases, hyperparameters are not equally important.
>A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for configuring algorithms for new data sets. - Bergstra, 2012

### Sklearn classification_report
* What is precision? Ability of a classification model to identify all relevant instances

* What is recall? Ability of a classification model to return only relevant instances

* Increasing recall = Decreasing precision
* What is F1-Score? Single metric that combines recall and precision using the harmonic mean

___
## Day 4: KNN - SVM
### K-Nearest Neighbor (KNN)
* What is KNN?
KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.

* What is the curse of high dimensionality?
KNN performs better with a lower number of features than a large number of features. It is when the number of features increases than it requires more data. Increase in dimension also leads to the problem of overfitting. To avoid overfitting, the needed data will need to grow exponentially with the increase of dimensions. This problem of higher dimension is known as the Curse of Dimensionality. To deal with the problem of the curse of dimensionality, we need to perform principal component analysis before applying any machine learning algorithm, or we can also use feature selection approach.

### Support Vector Machine (SVM)
* What is SVM?
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.
* What is hyperplane?
Hyperplane is a 2D line and 3D plane to an arbitrary number of dimensions.
Data can be separated by a hyperplane is known as **linearly separable** data.
The hyperplane acts as a linear classifier.

* What are the margins?

The goal is to find the plane with maximum margins where the middle line is the best line.
SVM has a parameter called C to 'harden' or 'soften' the margin, depending on larger or smaller C.

Optimal C can be obtained by tuning with K-Fold Cross Validation
Non-linearly separable data can be solved with higher dimension


* Unit vector: http://mathworld.wolfram.com/UnitVector.html
___
#### Note from Pierre's speech
**Check Machine Heatlh**
* Turn vibration frequency into numbers to classify the health of the machine
* Domain knowledge is to help you find the right data for your model.
___
## Day 5: Unsupervised Learning
* A **supervised learning** algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.
* The problem of **unsupervised learning** is that of trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution.
### K-means Clustering
* What is K-means Clustering?
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different (far) as possible.
* How does it work?
* Specify number of clusters K.
* Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
* Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.
* Compute the sum of the squared distance between data points and all centroids.
* Assign each data point to the closest cluster (centroid).
* Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.
* How to find optimal k in k-means? Elbow method https://en.wikipedia.org/wiki/Elbow_method_(clustering)
The basic idea behind this method is that it plots the various values of cost with changing k. As the value of K increases, there will be fewer elements in the cluster. So average distortion will decrease. The lesser number of elements means closer to the centroid. So, the point where this distortion declines the most is the elbow point.
For reference: https://www.geeksforgeeks.org/ml-determine-the-optimal-value-of-k-in-k-means-clustering/

Elbow is forming at K=3
* To visualize K-Means: https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
### Hierarchical Clustering
* What is hierarchical clustering?
Hierarchical clustering is a type of unsupervised machine learning algorithm used to cluster unlabeled data points. Like K-means clustering, hierarchical clustering also groups together the data points with similar characteristics.
* What are types of hierarchical clustering?
* Agglomerative: initially each data point is considered as an individual cluster. At each iteration, the similar clusters merge with other clusters until one cluster or K clusters are formed.
* Divisive is not much used in the real world, exactly the opposite of the Agglomerative Hierarchical clustering. All the data points are considered as a single cluster and in each iteration, we separate the data points from the cluster which are not similar. Each data point which is separated is considered as an individual cluster. In the end, we’ll be left with n clusters.
* How does it work?
* At the start, treat each data point as one cluster. Therefore, the number of clusters at the start will be K, while K is an integer representing the number of data points.
* Form a cluster by joining the two closest data points resulting in K-1 clusters.
* Form more clusters by joining the two closest clusters resulting in K-2 clusters.
* Repeat the above three steps until one big cluster is formed.
* Once single cluster is formed, dendrograms are used to divide into multiple clusters depending upon the problem.

### PCA(Principle Component Analysis)
* What is PCA?
PCA finds a new set of dimensions (or a set of basis of views) such that all the dimensions are orthogonal (and hence linearly independent) and ranked according to the variance of data along them. It means more important principle axis occurs first. (more important = more variance/more spread out data)
* How to choose number of components?
```
pca = PCA().fit(digits.data)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
```

___
## Day 6:
* Note:
* Noises will be added to images of clothing (zooming, rotation, blurring)
Follow PEP 8 for beautiful Python code: https://www.python.org/dev/peps/pep-0008/