IML: Supervised learning

Supervised learning

Supervised learning: process of teaching a model by feeding it input data as well as correct output data. The model will (hopefully) deduce a correct relationship between the input and output

An input/output pair is called labeled data
- All pairs form the training set
Once training is completed, the model can infer new outputs if fed with new inputs.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Given some training data

{x_{i}, y_{i}}_{i = 1}^{n}

, supervised learning aims at finding a model

f

correctly mapping input data

x_{i}

to their respective output

The model can predict new outputs
The learning mechanism is called regression or classification

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Managing data for supervised learning

Hide some data out during training (

≃ 20 %

data) to further evaluate model performances

\Rightarrow

train/test split
Use validation set (

≃ 15 %

data) if parameters are iteratively adjusted

\Rightarrow

tain/validation split

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Stratified sampling

For classification purposes

CLasses might be imbalanaced

\Rightarrow

use stratified sampling to guarantee a fair balance of train/est samples for each class

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Regression

The art of predicting values

Regression: the output value to predict

y

is quantitative (real number)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

\Rightarrow

How to mathematically model the relationship between predictor variables

x_{i}

and their numerical output

y_{i}

Linear regression

Sometimes, there's no need for a complicated model…

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Ordinary Least Squares

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Anscombes' quartet

For all 4 datasets

{(x_{1}, y_{1}), (x_{2}, y_{2}), . . ., (x_{11}, y_{11})}

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Le 3e regression a une donnee aberrante, cad une donnee tres eloignee des autres qui risque de fausser la regression (probablement du au capteur qui s'est chie dessus)

\Rightarrow

Linear regression line

y = 3 + 0.5 x

and

R^{2} = 0.67

are the SAME for all 4 datasets

Least absolute deviation

Linear regression by OLS is sensitive to outliers (tj=hank you

L_{2}

norm…)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Is it a good idea ?

$β_{L A D}$ is the MLE estimator of
$β$ when noise follows a Laplace distribution
No analyticial formula for LAD
- Harder to find the solution
- Must use gradient descent approach
Solution of LAD may not be unique

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Toutes les droites dans le cone sont optimales

Adding some regularization

Add apenalty term to OLS to eforce particular properties to

\hat{β}

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

From regression to classification

Logistic regression

Linear regression predicts a real value

\hat{y}

based on predictor variables

x = (x^{(1)}, . . ., x^{(} k))

Does not work is
$y$ is boolean
$P (y = 1) = p$ and
$P (y = 0) = 1 - p$
Use logistic regression instead

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Linear relationship between predictor variables and logit of event:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

k-nearest neighbors

k-NN classifier simply assigns test data points to the majority class in the neighborood of the test points

no real training step

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Result:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Choosing k

small k: simple but noisy decision boundary
large k: smoothed boundaries but computationally intensive
$k = \sqrt{n}$ can also serve as a starting heuristic, refined by cross-validation
$k$ should be odd for binary classification

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

k-nearest neighbors for regression

Use the k nearest neighbors (in terms of features only) and average to get predicted value

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Support Vector Machine

Linear SVM

Training set:

{x_{i}, y_{i}}_{i = 1}^{n}

with

x_{i} \in R^{p}

and

y_{i} \in {- 1, + 1}

Goal: find hyperplane that best divide positive sample and negative samples

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Qu'est-ce qu'on a envie de faire ici ?
Une moyenne

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

On cherche la droite qui passe le plus au centre

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Rappel: produit scalaire de 2 vecteurs colineaires:

< \vec{w}, \vec{A B} >= ‖ \vec{w} ‖ . ‖ \vec{A B} ‖

Soft margin SVM

Data may not be fully linearly separable

Kernel SVM

Remember the kernel trick ?

Kernel trick:

map data points into high dimesional space where they would become linearly separable
Effortlessly interfaced with the SVM by replacing dot product
$< ., . >$ by kernelizes version
$k (., .)$

Widely used kernel functions:

Polynomial kernel
Gaussian RBF kernel
Sigmoid kernel

Choosing the right kernel with the right hyperparameters

Kernel

\Rightarrow

Try linear first. If does not work, RBF is probably the best kernel choice (unless you have some prior information on the geometry of your dataset)

Hyperparameters (

C

+ kernel parameter(s))

\Rightarrow

grid search and cross-validation

Mutliclass SVM

What if we have more than 2 classes ?

2 possible strategies

one vs all: One SVM model per class

\to

separate the class from all other classes

Assign new points with winner takes all rule
if no outright winner, assign point to the class of closest hyperplane (Platt scaling)

One versus one: one SVM model per pair of classes

\to

separate 2 classes at a time, ignoring the other data

assign new points with majority voting rule

Decision trees

Decision trees use recusrive partitioning to create a sequence of decision rules on input features that nested split of data points

Input features can be numeric (decision

\leq

) or categorical (decision

==

)

Decision node

=

decision rule for one feature

Classification tree

\to

predict class
Regression tree

\to

predict real number

On the current node, try to apply all the possible decision rules for all features and select the decision that best split the data
Classification tree

\to

impurity riterion
Regression tree

\to

variance reduction

Final decision boundaries

\equiv

overlapping orthogonal half planes
Decision on new data

\to

running it down through the branches and assign classes

How to split a node

Which split should we choose between

La reponse est goche

Stop recursive partitionning if node is pure

Pros and cons of decision trees

Pros

Simple decision rules
Surprisingly computationally efficient
Handle multiclass problems
Handle numeric and categorical features at the same time

Cons

Strongly overfit data
- Bad predictive accuracy

Potential solution
Restrain the growth of the tree by imposing a maximal tree depth

Random forests

Bagging several decision trees

Decision trees are weak classifiers when considered individually

Average the decision of several of them
- Compensate their respective errors (wisdom of crowds)
Useless if all decision trees see the same data
- introduce some variability with bagging (bootstrap aggregating)
Introduce more variability by selecting only
$p$ out of
$m$ total features for each split in each decision tree (typically
$p = \sqrt{m}$ )

Final decision is taken by majority voting on all decision tree outputs

Decision boundaries comparison

Evaluating regression/classification performances

Cross-validation

k

-fold cross validation

Divide whole data into
$k$ non-overlapping sample blocks
Train
$k$ models on
$(k - 1)$ training blocks and test on remaining block
Compte perf metrics of each model + avergae & standard deviation of all
$k$ models

IML: Supervised learning

Supervised learning

Managing data for supervised learning

Stratified sampling

Regression

Linear regression

Ordinary Least Squares

Anscombes' quartet

Least absolute deviation

Adding some regularization

From regression to classification

Logistic regression

k-nearest neighbors

Choosing k

k-nearest neighbors for regression

Support Vector Machine

Linear SVM

Soft margin SVM

Kernel SVM

Choosing the right kernel with the right hyperparameters

Mutliclass SVM

What if we have more than 2 classes ?

Decision trees

How to split a node

Pros and cons of decision trees

Pros

Cons

Random forests

Decision boundaries comparison

Evaluating regression/classification performances

Cross-validation

Confusion matrix