# IML: Supervised learning # Supervised learning :::info **Supervised learning:** process of teaching a model by feeding it input data as well as correct output data. The model will (hopefully) deduce a correct relationship between the input and output ::: - An input/output pair is called *labeled data* - All pairs form the *training set* - Once training is completed, the model can infer new outputs if fed with new inputs. ![](https://i.imgur.com/Qldp6N6.png) Given some training data $\{x_i,y_i\}^n_{i=1}$, supervised learning aims at finding a model $f$ correctly mapping input data $x_i$ to their respective output - The model can predict new outputs - The learning mechanism is called *regression* or *classification* ![](https://i.imgur.com/Xfx2W4Y.png) # Managing data for supervised learning Hide some data out during training ($\simeq20\%$ data) to further evaluate model performances $\Rightarrow$ train/test split Use validation set ($\simeq15\%$ data) if parameters are iteratively adjusted $\Rightarrow$ tain/validation split ![](https://i.imgur.com/dNjoxrs.png) ## Stratified sampling > For classification purposes CLasses might be imbalanaced $\Rightarrow$ use stratified sampling to guarantee a fair balance of train/est samples for each class ![](https://i.imgur.com/LEWGAc9.png) # Regression > The art of predicting values :::info **Regression**: the output value to predict $y$ is quantitative (real number) ::: ![](https://i.imgur.com/ASmFdgA.png) $\Rightarrow$ How to mathematically model the relationship between predictor variables $x_i$ and their numerical output $y_i$ ? ## Linear regression Sometimes, there's no need for a complicated model... ![](https://i.imgur.com/KnOVw16.png) ![](https://i.imgur.com/v2HuFrX.png) ## Ordinary Least Squares ![](https://i.imgur.com/o2HCOeb.png) ## Anscombes' quartet For all 4 datasets $\{(x_1,y_1),(x_2,y_2),...,(x_{11},y_{11})\}$ ![](https://i.imgur.com/5GCBu59.png) Le 3e regression a une *donnee aberrante*, cad une donnee tres eloignee des autres qui risque de fausser la regression (probablement du au capteur qui s'est chie dessus) $\Rightarrow$ Linear regression line $y=3+0.5x$ and $R^2=0.67$ are the **SAME** for all 4 datasets ## Least absolute deviation Linear regression by OLS is sensitive to outliers (tj=hank you $L_2$ norm...) ![](https://i.imgur.com/GLH1gJh.png) ![](https://i.imgur.com/joBDMuK.png) *Is it a good idea ?* - $\beta_{LAD}$ is the MLE estimator of $\beta$ when noise follows a Laplace distribution - No analyticial formula for LAD - Harder to find the solution - Must use gradient descent approach - Solution of LAD **may not be unique** ![](https://i.imgur.com/OnwIuhP.png) :::warning Toutes les droites dans le cone sont optimales ::: ## Adding some regularization Add apenalty term to OLS to eforce particular properties to $\hat\beta$ ![](https://i.imgur.com/2Kcnvje.png) ## From regression to classification ### Logistic regression Linear regression predicts a real value $\hat y$ based on predictor variables $x=(x^{(1)},...,x^(k))$ - Does not work is $y$ is boolean - $P(y=1)=p$ and $P(y=0)=1-p$ - Use logistic regression instead ![](https://i.imgur.com/lLIaSGd.png) Linear relationship between predictor variables and logit of event: ![](https://i.imgur.com/JA8tFGp.png) ![](https://i.imgur.com/bsuCgNc.png) # k-nearest neighbors k-NN classifier simply assigns test data points to the majority class in the neighborood of the test points - no real training step ![](https://i.imgur.com/lv8noT8.png) Result: ![](https://i.imgur.com/pBdj6Wt.png) ## Choosing k - small k: simple but noisy decision boundary - large k: smoothed boundaries but computationally intensive - $k=\sqrt{n}$ can also serve as a starting heuristic, refined by cross-validation - $k$ should be odd for binary classification ![](https://i.imgur.com/ydcwHXm.png) ## k-nearest neighbors for regression Use the k nearest neighbors (in terms of features only) and average to get predicted value ![](https://i.imgur.com/XDcEhsh.png) ![](https://i.imgur.com/ETtZj4O.png) # Support Vector Machine ## Linear SVM Training set: $$\{x_i,y_i\}_{i=1}^n$$ with $x_i\in\mathbb R^p$ and $y_i\in\{-1,+1\}$ Goal: find hyperplane that best divide *positive* sample and *negative* samples ![](https://i.imgur.com/Lw1a9SK.png) *Qu'est-ce qu'on a envie de faire ici ?* Une moyenne ![](https://i.imgur.com/DxOBlOY.png) :::success On cherche la droite qui *passe le plus au centre* ![](https://i.imgur.com/yLBeNxo.png) ::: Rappel: produit scalaire de 2 vecteurs colineaires: $$ <\vec w, \vec{AB}> = \Vert \vec w\Vert.\Vert \vec{AB}\Vert $$ ![](https://i.imgur.com/Bg7DrvS.png) ## Soft margin SVM Data may not be fully linearly separable ## Kernel SVM > Remember the kernel trick ? Kernel trick: - map data points into high dimesional space where they would become linearly separable - Effortlessly interfaced with the SVM by replacing dot product $<.,.>$ by kernelizes version $k(.,.)$ ![](https://i.imgur.com/Eb8S2SB.png) Widely used kernel functions: - Polynomial kernel ![](https://i.imgur.com/jWSsnBZ.png) - Gaussian RBF kernel ![](https://i.imgur.com/qFneRGj.png) - Sigmoid kernel ![](https://i.imgur.com/0j7NqQr.png) ## Choosing the right kernel with the right hyperparameters Kernel $\Rightarrow$ Try linear first. If does not work, RBF is probably the best kernel choice (unless you have some prior information on the geometry of your dataset) Hyperparameters ($C$ + kernel parameter(s)) $\Rightarrow$ grid search and cross-validation # Mutliclass SVM ## What if we have more than 2 classes ? 2 possible strategies **one vs all:** One SVM model *per class* $\to$ separate the class from all other classes - Assign new points with *winner takes all* rule - if no outright winner, assign point to the class of closest hyperplane (Platt scaling) **One versus one**: one SVM model *per pair of classes* $\to$ separate 2 classes at a time, ignoring the other data - assign new points with *majority voting* rule ![](https://i.imgur.com/nRtugaz.png) # Decision trees :::info Decision trees use recusrive partitioning to create a sequence of decision rules on input features that nested split of data points ::: Input features can be numeric (decision $\le$) or categorical (decision $==$) Decision node $=$ decision rule for one feature Classification tree $\to$ predict class Regression tree $\to$ predict real number ![](https://i.imgur.com/boFTfRm.png) On the current node, try to apply all the possible decision rules for all features and select the decision that best split the data Classification tree $\to$ impurity riterion Regression tree $\to$ variance reduction ![](https://i.imgur.com/QV9u5WY.png) Final decision boundaries $\equiv$ overlapping orthogonal half planes Decision on new data $\to$ running it down through the branches and assign classes ## How to split a node Which split should we choose between ![](https://i.imgur.com/upBo3VQ.png) ![](https://i.imgur.com/GBwhhZY.png) > La reponse est goche ![](https://i.imgur.com/E0oAirn.png) ![](https://i.imgur.com/REyBYVW.png) :::success Stop recursive partitionning if node is pure ::: ## Pros and cons of decision trees ### Pros - Simple decision rules - Surprisingly computationally efficient - Handle multiclass problems - Handle numeric and categorical features at the same time ## Cons - Strongly overfit data - Bad predictive accuracy ![](https://i.imgur.com/NWFghGJ.png) :::success **Potential solution** Restrain the growth of the tree by imposing a maximal tree depth ::: ## Random forests > Bagging several decision trees Decision trees are *weak* classifiers when considered individually - Average the decision of several of them - Compensate their respective errors (*wisdom of crowds*) - Useless if all decision trees see the same data - introduce some variability with *bagging* (bootstrap aggregating) - Introduce more variability by selecting only $p$ out of $m$ total features for each split in each decision tree (typically $p=\sqrt{m}$) ![](https://i.imgur.com/sypxX62.png) Final decision is taken by majority voting on all decision tree outputs ![](https://i.imgur.com/ToaBCwg.png) # Decision boundaries comparison ![](https://i.imgur.com/iizKHKj.png) ## Evaluating regression/classification performances ![](https://i.imgur.com/fvXdKV8.png) # Cross-validation $k$-fold cross validation - Divide whole data into $k$ non-overlapping sample blocks - Train $k$ models on $(k-1)$ training blocks and test on remaining block - Compte perf metrics of each model + avergae & standard deviation of all $k$ models ![](https://i.imgur.com/nrYJlCm.png) # Confusion matrix ![](https://i.imgur.com/KHqLVmX.png)