# WEEK 6 (21-25/12/20) ## KNN & Naive Bayes (Mon, 21/12/20) ### **Bias-variance tradeoff** **==Error due to Bias==**: The error due to ==bias=== is taken as the difference between the ==expected (or average) prediction== of our model and the correct value which we are trying to predict. **==Error due to Variance==**: The error due to ====variance is taken as the variability of a model prediction for a given data point. It captures ==how much the predictions for a GIVEN POINT vary when using DIFFERENT MODELS== to make these predictions ![](https://i.imgur.com/4XJz5j9.png) ---> ==High-variance== learning methods may be able to represent their training set well but are at risk of ==overfitting== ==**Detecting High Bias and High Variance**== * High Bias: Training error is high: * Use more complex model (e.g. higher polynomial, or kernel methods) Adding features that have strong predictive powers * High Variance: Training error is low, but it is much lower than test error (test error is high) * Add more training data Reduce the model complexity (e.g. Regularization) ### **Regularizations for linear model** ---> ==**reduce overfit**== By adding a regularization term to the loss function: * Try to fit the training data well (by keep small loss) * Try to ==keep the weights small== so no weight (weights) will stand out and cause overfitting ==**L2 Regularization**== ![](https://i.imgur.com/vznJn5m.png) ==**Ridge regression**== ![](https://i.imgur.com/HHAJxoP.png) --- --- **==L1 Regularization==**: ![](https://i.imgur.com/rOqJcFw.png) ==**Lasso Regression:**== ![](https://i.imgur.com/qY63U1U.png) --- --- **==Elastic Net (a middle ground between Ridge Regression and Lasso Regression)==** $$ min_w \frac{1}{m} \sum_{i=1}^{m}{(\hat{y}^{(i)} - y^{(i)})^2} + \alpha r ||w||_1 + 0.5\alpha(1-r)||w||_2^2 $$ Alpha is used to penalize the weights * When alpha is small ---> less regularization, remains overfitting * When alpha is large ---> more regularization, underfitting ==**Hyperparameters**==: things you change to change your model e.g. r, alpha, learning rate, max_iter,... ### **K-Nearest Neighbors** KNN is a **non-parametric** classifier algorithms. It simply looks at the K points in the training set that are ==nearest== to the input x, and assign the most common label to the output. K---> hyperparameter ==Parametric models==: ADVANTAGE: faster to predict DISADVANTAGE: making stronger assumptions about the data distribution. ==Non-parametric==: ADVANTAGE:more flexible DISADVANTAGE: computationally expensive. ==**The distance function**== **Manhattan distance**, or **$L_1$ distance**, or **$L_1$ norm** $$ dist(x, x') = \sum_{i=1}^{n}{|x_i - x'_i|} $$ --- **Euclidean distance**, or **$L_2$ distance**, or **$L_2$ norm** $$ dist(x, x') = \sqrt{\sum_{i=1}^{n}{(x_i - x'_i)^2}} $$ SMALL K: higher chance of overfit LARGE K: underfit ### **K-fold cross validation**: to find the best set of hyperparameters ![](https://i.imgur.com/KCcCpbm.png) * Randomly split the training dataset into k folds without replacement, where 𝑘−1 folds are used for the model training, and one fold is used for performance evaluation. * Do that k times to obtain k models and performance estimates. * Calculate the average performance of the models (ie final validation scores) ==**Curse of Dimensionality**== When number of features (p) is large, and dataset size (n) is still the same, KNN performs worst ## Linear SVM & Kernel SVM (Tue, 22/12/20) The Support Vector Machine SVM is a ==**linear classifier**== that finds the maximum margin separating hyperplane. In other words, SVM find the separating hyperplane that **==maximizes the distance to the closest data points from both classes==**. ![](https://i.imgur.com/kZeZWgJ.png) distance from the closest green point to the blue line = margin of green class **==FORMULA FOR MARGIN:==** $$ \Rightarrow \gamma(\mathbf{w},b)=\min_{\mathbf{x}\in D}\frac{\left | \mathbf{w}^T\mathbf{x}+b \right |}{\left \| \mathbf{w} \right \|_{2}} $$ --->$$ \begin{align} &\min_{\mathbf{w},b}\mathbf{w}^T\mathbf{w}&\\ &\textrm{s.t.} \ \ \ \forall i \ y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b) \geq 1 & \end{align} $$ ==**Hard margin**==: assume training data is linearly separable **==Support vectors==** are any points that satisfy the condition $$ y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b) = 1. $$ --- **==Soft margin==**: ---> The ==hinge loss== fucntion: $$ \min_{\mathbf{w},b}\underbrace{\mathbf{w}^T\mathbf{w}}_{l_{2}-regularizer} + C\ \sum_{i=1}^{m}\underbrace{\max\left [ 1-y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}+b),0 \right ]}_{hinge-loss} $$ **==C==**: allow points within the margins ==Higher C== = strong penalty. More points allow on the street, the bigger the hinge loss will become. When C is big enough, no point is even allowed within our margins. ---> hard margin, and this can lead to overfitting ==Lower C== = weak penalty. More points are allowed within the margins --- ### ==**Kernel method**== The main idea of kernel methods is projecting the data into a high dimensional space where it is linearly separable and find a hyperplane in this space. ## Tree-based Model, Bagging, and Boosting (Wed, 23/12/20) ### **Decision Tree Classifier** ![](https://i.imgur.com/VuqP1WV.png) **==Gini Score==**: measure of impurity (how mixed the class are in 1 group, the higher the Gini the more equally distributed number of classes * MAX: when all targets are equally distributed * MIN: when all observations belong to one class ==Which feature to split?== * Select feature, split data based on the feature * Compute Gini scores of the 2 groups created by the split * Compute an average value of the 2 Gini scores as that feature's Gini-value * The feature with the lowest Gini-value is the chosen to split the data ---> Use the feature split with the **==SMALLEST==** Gini score ---> Decision Tree can overfit/underfit (can "memorize" the data set) Decision trees are ==**prone to overfitting**==, especially when a tree is particularly deep. This is due to the **==amount of specificity==** we look at leading to smaller sample of events that meet the previous assumptions. This small sample could lead to unsound conclusions. ---> Solution: control the depth (max_depth) As max depth of decision tre increase, error on train set goes down, on test set goes up ---> Overfitting * max_features : set using a probabililty (eg 0.5) ### **Random Forest Classifier** **STEPS**: * **Sample the $k$ datasets** $D_1,\dots,D_k$, and $D_i$ **==with replacement==** from $D$ * For each $D_i$ train a Decision Tree with **subsample features without replacement** (this increases the variance of the trees) * The final classifier is $h(\mathbf{x})=\frac{1}{k}\sum_{j=1}^k h_j(\mathbf{x})$ ## Unsupervised Learning (Thu, 23/12/20) (Working with un-labelled data) ### Clustering #### ==**K-means**==: put data into groups .fit(X): unsupervised therefore only fit X ### ==**Principal components analysis PCA**== PCA reduces the dimensionality of the data set, allowing most of the variability to be explained using fewer variables ---> PURPOSE: maximize variance ## Review & Module Test (Fri, 24/12/20) ### **NAIVE BAYES CLASSIFIER** #### Conditional probability $$ P(A|B) = \frac{P(A \cap B)}{P(B)} $$ **The chain rule** follows from the definition of conditional probability: $$ P(A \cap B) = P(A|B)P(B) = P(B|A)P(A) $$ ---> **Bayes' rule**: $$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$ $P(A)$: **the prior** $P(A|B)$: **the posterior** $P(B|A)$: **the likelihood**. e.g. $$ P(label|data) $$ --->==ASSUMPTION==: ==conditional independence== between every pair of features given the value of the class variable ==**ADVANTAGES**== * For classification and spam filtering. * Require a small amount of training data to estimate the necessary parameters ==**DISADVANTAGES**== * Prone to underfit #### **Multinomial features** **NAIVE BAYES** $$ P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)} {P(x_1, \dots, x_n)} $$ $$ P("ad"|spam)=P(x_i|y)=\frac{\text{# of times word } \alpha \text{ appears in all spam emails}}{\text{# of words in all spam emails combined}}. $$ # ==**MISCELLANEOUS SHIT**== np.c_ : add new columns for numpy PIPELINE: follows an order # **==TEST==** 1. Linear regression weight, loss function, gradient decent steps,... how to evaluate model, overfitting, underfitting **no formulas, no maths** 2. KNN regularization bias, variance, trade off, what can you do with KNN (regression-yes(take average)) when will KKN overfit/underfit? 3. Naive Bayes formula, assumption? 4. Linear SVM what is support vector? dotted lines, hard/soft margin with regards to C value 5. Kernel SVM project on a high dimension to classify kernel trick: 6. Decision tree score used to split tree? How to split tree hyperparameter to avoid underfit/overfit 7. Bagging boosting (combine all model then take average) random forest 8. Unsupervised learning Clustering (kmeans, hierarchical) HOW TO FINE TUNE SHIT (PIPELINE)