--- title: 'Tâm - Week 5' tags: CoderSchool, Mariana --- Week 5 === ## Table of Contents [TOC] ## Monday ### how to organize your machine learning project - Frame the problem and look at the big picture - Get the data - Explore the data to gain insights - Prepare the data to better expose the underlying data patterns to ML algorithms - Explore many different models and shortlist the best ones - Fine-tune your models and combine them into a great solution - Present your solution - Launch, monitor and maintain your system ## Tuesday ### Decision tree - Decision tree can underfit or overfit - Decision tree can memorize the dataset - underfit: low accuracy on train and test; overfit: high accuracy on train and low on test; good fit: high accuracy on train and test - Random forest - Wisdom of the crowd: The models are independent to each other. - The data split into subsets for each tree - `RandomForestClassifier` `n_estimators` is the number of trees `max_depth` - **Classification report:** - precision: TP/(TP+FN): is how many times we predict correctly based on total number of times we predict - recall: TP/(FN+TP) is how many patients we can predict "true" based on the total number of patients - F1 score: $\frac{2.(recall.precision)}{(recall+precision)}$ - **Tuning hyperparameter:** gridsearch is good with small amount of choise; random is better with larger amount of choices - K-fold validation: split train set into k fold datasets ## Wednesday ### K Nearest Neighbors - in multidimensional distance is similarity ### Credit card fraud detection https://colab.research.google.com/drive/1mLCRTM-Hnx02_U27oX67TA4FhRlqt41l - Oversampling - Undersampling - Data resampling techniques ## Thursday https://colab.research.google.com/drive/1bBwXsE1oVa4Ik6uQ63hjBN-58py3VwFi ### Unsupervised learning - K-means clustering: cluster data points into K groups - Each group has centroid. At the start of the algorithm, we start with a random centroid. Then group the datapoint based on the nearest centroid. Then move the centroid to the middle of the centroid. Keep repeating the steps until the centroid stop moving a lot. - Elbow method: choose the elbow of the score vs k graph - K means pros and cons - **Risk with k-means** we reach a local minima when assigning the centroids. ### Hierarchical clustering - We can perfrom bottom-up or top-down ### PCA - When we plot number of components vs variance explained in the data we can see the number of components explaining for the variance in the data - We can choose the least number of components that best explain the variance in the data ## Friday ### Interview prep ### Fashion MNIST dataset - preprocessing data is very important. - normalizing the data and pca data allows models to train faster and more accurately :::info **Find this document incomplete?** Leave a comment! ::: ###### tags: `CoderSchool` `Mariana` `MachineLearning`