ML Cheatsheet - HackMD

# ML Cheatsheet ###### tags: `INTERVIEW PREP` ## Machine Learning Basics ### Underfitting/Overfitting - Overfitting: is to fit the model too much that the model only memorises the data, hence it is not able to generatlize to unseen data. Hence, the validation loss for this scenario will become higher as overfits getting worse. High Variance • Perform regularization (lasso, ridge regluarization) • Get more data  - Underfitting: Did not learn enough from data. High Bias • Complexify model • Add more features • Train longer ### Bias/Variance Tradeoff - Bias: The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points. - Variance The variance of a model is the variability of the model prediction for given data points. -**The simpler the model, the higher the bias, and the more complex the model, the higher the variance. Cannot accurately predict on seen data and generalize on unseen.** ### Generative & Discriminative Models - Generative: Estimate P(x|y)P(y) to then deduce P(y|x), learn the distribution of data - Discriminative: Directly estimate P(y|x), learn the decision boundary ### Case: - Give a set of ground truths and 2 models, how do you be confident that one model is better than another? Perform k-fold cross validation to see which model generatlises best ## Classification Algorithms ### Support Vector Machine The SVM should maximize the distance between the two decision boundaries. Mathematically, this means we want to maximize the distance between the hyperplane defined by 𝐰𝐓𝐱+𝑏=−1 and the hyperplane defined by 𝐰𝐓𝐱+𝑏=1. Hence, we will have a quadratic optimization problem. This will be the hard margin. Then, we can add slack variables and a C parameter as the error term. Kernel tricks can be implemented to project to high dimensionality for easier computation. ### Naive Bayes Classifier We want to find the probability that a new example belongs to each class: 𝑃(𝑐𝑙𝑎𝑠𝑠|𝑓𝑒𝑎𝑡𝑢𝑟𝑒1,𝑓𝑒𝑎𝑡𝑢𝑟𝑒2,...,𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑛). We then compute that probability for each class, and pick the most likely class. The problem is that we usually don't have those probabilities. (𝑐𝑙𝑎𝑠𝑠|𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)=𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠|𝑐𝑙𝑎𝑠𝑠)⋅𝑃(𝑐𝑙𝑎𝑠𝑠)/𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠) 𝑃(𝑐𝑙𝑎𝑠𝑠|𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)∝𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠|𝑐𝑙𝑎𝑠𝑠)⋅𝑃(𝑐𝑙𝑎𝑠𝑠) Even if the features are not actually independent, we can assume they are (that's the "naive" part of naive Bayes). ## Regularization - L1 Lasso Regularization Shrinks less important feature coefficients to 0, hence good for variable selection. Also - L2 Ridge Regluarization Makes coefficients smaller The Lasso can be thought of as a Bayesian regression with a Laplacian prior (L1 as the regularization term in the optimization cost function). And Ridge can be similar but with a Gaussian prior (L2 as the regularization term). The intuitive difference between L1 and L2: L1 tries to estimate the median of the data while L2 tries to estimate the mean of the data. - Derivation - Sparsity - Why L3, L4 not require? L1, L2 cover majority ## Metrics - Precision TP/(TP+FP), Precision is used when there is a high cost associated with false positive. ==Example==: If we are going to classify videos from youtube for children to watch, we would like the precision to be high. If we have high false positive rates like misclassifying violent videos as children friendly videos, then it would be a complete disaster. - Recall Recall = TP/(TP+FN), Recall is used when there is a high cost associated with false negative. ==Example==: If we have the dataset to classify whether a patient has cancer, then we would want our recall to be high. This is because we would not want the patients that actually has cancer but to get classified as not having cancer. It would be more benevolent comparing to false positive as mistaking people to have cancer. - Tradeoff Between Increase precision will reduce recall - F1 F1 score is a weighted average of precision and recall. It is needed when you want to seek a balance between precision and recall. Compared with accuracy, it performs better when the data classes are unbalanced. Micro vs Macro A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric. In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance. **Example Use Cases** Micro F1: Use for image classification with many classes, where each image is equally important (e.g., object detection). Macro F1: Use for evaluating minority classes equally (e.g., fraud detection across multiple fraud types). Weighted F1: Use for text classification where some categories are much more common than others, like sentiment analysis across different regions. - Confusion Metric It is a table layout that allows visualization of the performance of an algorithm. Each row represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa. tp, fp fn, tn - ROC An ROC curve is a graph showing the performance of a classification model at different threshold. The curve plots two parameters: true positive rate and false positive rate. True positive rate = recall = TP/(TP+FN) False positive rate = FP/(FP+TN) Lowering the threshold allows more items to be classified as positive, thus increasing both true positive rate and false positive rate. - AUC Area under the ROC Curve AUC is ==scale-invariant==. It measures how well predictions are ranked rather than their absolute values. AUC is classification-threshold-invariant. It measures the quality of the model's prediction irrespective of what threshold is chosen. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0. - Log Loss ![](https://i.imgur.com/tqJYLSK.png) #### Summarization of Metrics Here’s a guide on when to use precision, recall, accuracy, and F1 score, with examples: 1. Accuracy What it is: The proportion of correct predictions (both true positives and true negatives) out of all predictions. When to use it: Use accuracy when classes are well-balanced, meaning there is no significant class imbalance. Example: If you're building a model to predict if an email is spam or not (with a balanced dataset), accuracy is a suitable metric. If 50% of emails are spam and 50% are not, accuracy will give a good idea of how well your model is doing. 2. Precision What it is: The proportion of true positives out of all positive predictions (true positives + **false positives**). When to use it: Use precision when false positives are more costly than false negatives, i.e., when you care more about the accuracy of positive predictions. Example: In fraud detection, precision is crucial. You want to minimize the cases where you falsely flag a legitimate transaction as fraud. High precision ensures that flagged transactions are likely to actually be fraudulent, saving time and reducing inconvenience. 3. Recall What it is: The proportion of true positives out of all actual positives (true positives + **false negatives**). When to use it: Use recall when false negatives are more costly than false positives, i.e., when it's essential to capture all positive cases. Example: In a medical diagnosis setting for a serious condition (e.g., cancer detection), recall is essential. It's more critical to ensure that all cases of cancer are detected, even if it means having some false positives, as missing a positive case could be life-threatening. 4. F1 Score What it is: The harmonic mean of precision and recall, balancing the two. When to use it: Use F1 score when you need a balance between precision and recall, especially in cases of imbalanced classes. Example: In spam detection with an imbalanced dataset (where most emails are not spam), the F1 score provides a good metric by combining precision and recall, especially if you want to avoid mislabeling emails while also minimizing missed spam. ## Loss and Optimization - Loss function V.S. Cost function Loss function is mainly about describing one single datapoint and its gold label. Cost function is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). For example: - Why MSE doesn’t work with logistic regression? When MSE loss function is plotted with respect to weights of the logistic regression model, the obtained curve is not convex, which makes it difficult to find the global minimum. This non-convex nature is caused because non-linearity is introduced in the form of sigmoid function. Instead, using MLE in logistic regression as the cost function is convex. - Mean Squared Error: Used commonly with regression problems to see y = (mean-predicted)**2 - OLS: The OLS method aims to minimize the sum of square differences between the observed and predicted values. - Relative Entropy Relative entropy is also called KL divergence. It is a measure of how one probability distribution is different from a second, reference probability distribution. $$ KL(P | Q) = \sum_{x} P(x)\log {\frac{P(x)}{Q(x)}} $$ - Cross Entropy $$ H(P,Q) = -\sum_x P(x)\log Q(x)$$ From the equation, we could see that KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p - SVM Loss Function: Hinge Loss ![](https://i.imgur.com/taAHkRx.png) ### Linear Regression - What if variables aren't independent? The unstable nature of the model may cause overfitting. If you apply the model to another sample of data, the accuracy will drop significantly compared to the accuracy of your training datase -> Variable Selection - MSE vs MLE In a linear model, if the errors belong to a normal distribution the least squares estimators are also the maximum likelihood estimators. ### Logistic Regression ![](https://i.imgur.com/Ek2Hz6U.png) ![](https://i.imgur.com/BsF30Ta.png) ### PCA Analysis In a situation where you have a WHOLE BUNCH of independent variables, PCA helps you figure out which linear combinations of these variables matter the most. Summarizing the features to fewer characteristics. In which these characteristics can recreate the original features while preserving the original information of the data. 1.Compute the covariance matrix of the whole dataset. 2.Compute eigenvectors and the corresponding eigenvalues. 3.Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors 4.with the largest eigenvalues to form a d × k dimensional matrix W. 5,Use this d × k eigenvector matrix to transform the samples onto the new subspace. ### Kmeans Clustering - fter randomly initializing the cluster centroids $$ \mu_1,\mu_2,...,\mu_k\in\mathbb{R}^nμ 1 ,μ 2 ,...,μ k ∈R n $$ the kk-means algorithm repeats the following step until convergence: initialize -> cluster -> update mean -> convergence ### KNN Classifier 1. Pick k neighbors 2. Get the majority vote for the k neighbors ## Cross Validation KFold is a cross-validator that divides the dataset into k folds. Stratified is to ensure that each fold of dataset has the same proportion of observations with a given label ## Ensemble Methods Ensemble methods combine several trees base algorithms to construct better predictive performance than a single tree base algorithm. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model. ### Boosting Trains sequentially, trying to correct previous ![](https://i.imgur.com/jXNrYcG.png) classify the output ### Bagging bagging reduces model’s variance, trains in parallel ![](https://i.imgur.com/XW3NndP.png)