---
title: 'Tâm - Week 5'
tags: CoderSchool, Mariana
---
Week 5
===
## Table of Contents
[TOC]
## Monday
### how to organize your machine learning project
- Frame the problem and look at the big picture
- Get the data
- Explore the data to gain insights
- Prepare the data to better expose the underlying data patterns to ML algorithms
- Explore many different models and shortlist the best ones
- Fine-tune your models and combine them into a great solution
- Present your solution
- Launch, monitor and maintain your system
## Tuesday
### Decision tree
- Decision tree can underfit or overfit
- Decision tree can memorize the dataset
- underfit: low accuracy on train and test; overfit: high accuracy on train and low on test; good fit: high accuracy on train and test
- Random forest
- Wisdom of the crowd: The models are independent to each other.
- The data split into subsets for each tree
- `RandomForestClassifier` `n_estimators` is the number of trees `max_depth`
- **Classification report:**
- precision: TP/(TP+FN): is how many times we predict correctly based on total number of times we predict
- recall: TP/(FN+TP) is how many patients we can predict "true" based on the total number of patients
- F1 score: $\frac{2.(recall.precision)}{(recall+precision)}$
- **Tuning hyperparameter:** gridsearch is good with small amount of choise; random is better with larger amount of choices
- K-fold validation: split train set into k fold datasets
## Wednesday
### K Nearest Neighbors
- in multidimensional distance is similarity
### Credit card fraud detection
https://colab.research.google.com/drive/1mLCRTM-Hnx02_U27oX67TA4FhRlqt41l
- Oversampling
- Undersampling
- Data resampling techniques
## Thursday
https://colab.research.google.com/drive/1bBwXsE1oVa4Ik6uQ63hjBN-58py3VwFi
### Unsupervised learning
- K-means clustering: cluster data points into K groups
- Each group has centroid. At the start of the algorithm, we start with a random centroid. Then group the datapoint based on the nearest centroid. Then move the centroid to the middle of the centroid. Keep repeating the steps until the centroid stop moving a lot.
- Elbow method: choose the elbow of the score vs k graph
- K means pros and cons
- **Risk with k-means** we reach a local minima when assigning the centroids.
### Hierarchical clustering
- We can perfrom bottom-up or top-down
### PCA
- When we plot number of components vs variance explained in the data we can see the number of components explaining for the variance in the data
- We can choose the least number of components that best explain the variance in the data
## Friday
### Interview prep
### Fashion MNIST dataset
- preprocessing data is very important.
- normalizing the data and pca data allows models to train faster and more accurately
:::info
**Find this document incomplete?** Leave a comment!
:::
###### tags: `CoderSchool` `Mariana` `MachineLearning`