Week 6 - HackMD

# Week 6 ## Day 1 - KNN & Regularization **Weekly Project Presentation** **Bias-Variance tradeoff** - High Bias (Training error is high) -- Use more complex model (e.g. kernel methods) -- Adding features - High Variance (Training error is much lower than test error) -- Add more training data -- Reduce the model complexity (e.g. Regularization) **K-Nearest Neighbors** Classify point by based on majority voting rule. It simply looks at the K points in the training set that are nearest to the input x, and assign the most common label to the output. This method is also called instance-based learning. **K-fold cross validation** ![](https://i.imgur.com/UezFnt4.png) In k-fold cross validation, we randomly split the training dataset, where **$k-1$ folds are used for the model training**, and **one fold is used for performance evaluation**. We do that k times to **obtain k models** and performance estimates. **Curse of Dimensionality** In high dimension space, data points tend to be far to each other --> Harder to dectect Neighbors --> Curse ## Day 2 - Linear SVM ### Support Vector Machine The Support Vector Machine SVM is a linear classifier that finds the maximum margin separating hyperplane. In other words, SVM find the separating hyperplane that maximizes the distance to the closest data points from both classes. Distance from closest data points from both classes should be equall ### Max Margin Classifier The margin γ is the distance from the hyperplane to the closest point across both classes. ### The support vectors Closest points to the hyperlane across class. These points is line up on the margin lines (hard margin). The hard margin lines help define the hyperlane For the optimal $\mathbf{w},b$, some training points will have tight contraints: $$ y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b) = 1. $$ ### Soft Margin Problem with hard margin: it is too sensitive with new data point. If a data point is added inside the margin, hard margin shift significantly --> easy to be overfitting. If there is no separating hyperplane between the two classes, then there is no solution for the optimization problem stated above. To fix it we introduce the hinge loss function and accept loss in our model: $$ \min_{\mathbf{w},b}\underbrace{\mathbf{w}^T\mathbf{w}}_{l_{2}-regularizer} + C\ \sum_{i=1}^{m}\underbrace{\max\left [ 1-y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b),0 \right ]}_{hinge-loss} $$ C is a hyperparameter that can be used in SVC library, lower C allow more loss and allow more point to be lie with in our 2 margin or point on the wrong side of hyperlane. C --> fixing overfitting problems for Linear SVM model. We use C (hyperparameter) to allow points within our margins ('on the street') - Higher C = strong penalty. The more points we allow on the street, the bigger the hinge loss will become. When C is big enough, no point is even allowed within our margins. This is hard margin, and this can lead to overfitting - Lower C = weak penalty. More points are allowed within the margins ### Kernel SVM Non-linearly separable data (Circle?). We use SVM is draw non-linear hyperplane ![](https://i.imgur.com/XxUvW6B.png) **Kernel method** Projecting the data into a high dimensional space where it is linearly separable and find a hyperplane in this space. --> If we can't solve problems in 2D Dimension then the Kernel method will project data into high dimensional space and solve the problem on that dimension. After that, it will project the hyperlane back into 2 dimension space. ## Decision Tree Classifier Learning a decision tree: sequence of if/else questions that gets us to the true answer most quickly. For example, if we try to classify animal into species based on Color and Height: ![](https://i.imgur.com/HF0hP5R.png) **Bad news**: Finding a minimum size tree is pretty hard **Good news**: we can approximate it with a greedy strategy. We keep splitting the data to minimize an **impurity function**. **ID3-Algorithms**: Try all features and all posible splits. Pick the split that minimizes impurity. ### Gini impurity for classification Gini Score is a measure of impurity. Gini Score is maxed when all target is equally distributed, it is min if all observations belong to 1 group. For example in the animal classification problems: - Gini (Root) - First Node: 1 Monkey, 2 elephant, 2 leopard, 3 giraffe --> Gini (Root) = 1 - [(1/8)^2 + (2/8)^2 + (2/8)^2 + (3/8)^2] = 0.71875 Overfit tree --> Too many group, data tend to group 1 observation to 1 group ![](https://i.imgur.com/oFGa2EL.png) ### Advantage of Decision Tree - Easy to use and interpret - Little requirement on Data Preparation - Able to handle both numerical and categorical data - Not a black box, can be easily understand what happen behind the scene - Do not need to be linear seperatable ## Bagging and Boosting **Bagging** Having multiple descion tree model evaluate the same dataset with different subset of feature. --> Random Forest It is used to deal with overfitting **Boosting** Black box machine learning technique. In only focus on training samples that are hard to classify. ![](https://quantdare.com/wp-content/uploads/2016/04/bb1-800x221.png) ![](https://quantdare.com/wp-content/uploads/2016/04/bb2-800x307.png) ![](https://quantdare.com/wp-content/uploads/2016/04/bb3-800x307.png) Bagging: model are running parallel and final result is picked based on majority vote. Boosting: every tree is evaluate sequentialy ![](https://quantdare.com/wp-content/uploads/2016/04/bb4-800x307.png) Bagging: Each of the tree will have it own prediction while Boosting each of the tree will have it own prediction but is weighted. ![](https://quantdare.com/wp-content/uploads/2016/04/bb5-800x285.png) Bagging: we train the data and then keep it. For Boosting, we train and evaluate. If we decide to keep the data, we set weight for the learner, if not we drop it. ## Day 3 Holiday ## Day 4 - California House Pricing Project **Workflow:** - Download dataset from github - Select metric for model: Root Square Mean Error - Split train set/test using split_train_test - Using Qcut to split dataset by Income Category - Discover data by using California Map Plot and Correlation - Prepare Data for ML algorithms: -- Data Cleaning -- Handling Categorical Attributes -- Transform data -- Scaling feature to make model faster -- Construct pipeline to combine all data transformation - Column Transformers: specify with column should receive preprocessing - Encoder technique for categorical columns: -- Ordinal Encoder - Numbering Category Variables -- OneHot Encoder - Create dummy variable - Select and train model: -- Base Model: Linear Regression, as a start model to see whether if your problem really need a more complex model? -- Decision Tree: Very easy to be overfit if we don't set the limit how deep the tree can growth. -- Random Forest: - Cross Validation Score: a library in sklearn that help split your training dataset into smaller sample and try to evaluate model by number of fold. One special things about cross_val_score is that this function expect greater is better rather than lower is better. So we have to pass negative RMSE into cross_val_score. **Fine-Tune your model** - GridSearchCV: a function that try to find every combination of parameter and print out the optimal combination - Randomized Search: specify a number of trial and let computer run iterations to find out what value ranges is optimal for our hyperparamter. The hyperparameter can be defined randomly. ## Day 5 - Unsupervised Learning Unsupervised problems: there is no label **K-means alogrithms:** - Pick Centroid of group randomly - Calculate the distance between point with centroid - Assign group based on distance on Centroid - Update the Centroids - Recalculate the distance between Centroids and Points - Assign group based on distance - Update the Centroids again - Repeat the whole process **Two main problem in K-Means:** - The number of K cluster? -- We can use elbow method to calculate MSE on all possible K-Values -- Mean Square Error: different from mean square error in other alogrithm. Mean Sqaure Error in K-mean is calculated base on the distance between Centroids and Points. -- Large K number of cluster can led to over engineering. We should choose a K number where scroe descrease slowly (like in the chart, optimal K should be 4) -- Another method: Hierarchical Clustering Algorithms --> bottom-up alogrimth that calculate the distance between groups ![](https://i.imgur.com/mGETGSk.png) - The initialization point: **k-means for color compression** Use k-means to group RGB color into 16 group --> less color values but can still keep the color scheme of image. However, other machine learning model is very complex and need to work with original images. **Principal Components Analysis** PCA recuders the dimensionality of the dataset, allowing most of the variability to be explained by fewer number of dimensions. We expecte to lose information when using PCA PCA can help denoising the dataset