Try   HackMD

IML: Supervised learning

Supervised learning

Supervised learning: process of teaching a model by feeding it input data as well as correct output data. The model will (hopefully) deduce a correct relationship between the input and output

  • An input/output pair is called labeled data
    • All pairs form the training set
  • Once training is completed, the model can infer new outputs if fed with new inputs.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Given some training data

{xi,yi}i=1n, supervised learning aims at finding a model
f
correctly mapping input data
xi
to their respective output

  • The model can predict new outputs
  • The learning mechanism is called regression or classification

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Managing data for supervised learning

Hide some data out during training (

20% data) to further evaluate model performances
train/test split
Use validation set (
15%
data) if parameters are iteratively adjusted
tain/validation split

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Stratified sampling

For classification purposes

CLasses might be imbalanaced

use stratified sampling to guarantee a fair balance of train/est samples for each class

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Regression

The art of predicting values

Regression: the output value to predict

y is quantitative (real number)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

How to mathematically model the relationship between predictor variables
xi
and their numerical output
yi
?

Linear regression

Sometimes, there's no need for a complicated model

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Ordinary Least Squares

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Anscombes' quartet

For all 4 datasets

{(x1,y1),(x2,y2),...,(x11,y11)}

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Le 3e regression a une donnee aberrante, cad une donnee tres eloignee des autres qui risque de fausser la regression (probablement du au capteur qui s'est chie dessus)

Linear regression line
y=3+0.5x
and
R2=0.67
are the SAME for all 4 datasets

Least absolute deviation

Linear regression by OLS is sensitive to outliers (tj=hank you

L2 norm)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Is it a good idea ?

  • βLAD
    is the MLE estimator of
    β
    when noise follows a Laplace distribution
  • No analyticial formula for LAD
    • Harder to find the solution
    • Must use gradient descent approach
  • Solution of LAD may not be unique

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Toutes les droites dans le cone sont optimales

Adding some regularization

Add apenalty term to OLS to eforce particular properties to

β^

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

From regression to classification

Logistic regression

Linear regression predicts a real value

y^ based on predictor variables
x=(x(1),...,x(k))

  • Does not work is
    y
    is boolean
  • P(y=1)=p
    and
    P(y=0)=1p
  • Use logistic regression instead

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Linear relationship between predictor variables and logit of event:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

k-nearest neighbors

k-NN classifier simply assigns test data points to the majority class in the neighborood of the test points

  • no real training step

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Result:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Choosing k

  • small k: simple but noisy decision boundary
  • large k: smoothed boundaries but computationally intensive
  • k=n
    can also serve as a starting heuristic, refined by cross-validation
  • k
    should be odd for binary classification

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

k-nearest neighbors for regression

Use the k nearest neighbors (in terms of features only) and average to get predicted value

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Support Vector Machine

Linear SVM

Training set:

{xi,yi}i=1n with
xiRp
and
yi{1,+1}

Goal: find hyperplane that best divide positive sample and negative samples

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Qu'est-ce qu'on a envie de faire ici ?
Une moyenne

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

On cherche la droite qui passe le plus au centre

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Rappel: produit scalaire de 2 vecteurs colineaires:

<w,AB>=w.AB

Soft margin SVM

Data may not be fully linearly separable

Kernel SVM

Remember the kernel trick ?

Kernel trick:

  • map data points into high dimesional space where they would become linearly separable
  • Effortlessly interfaced with the SVM by replacing dot product
    <.,.>
    by kernelizes version
    k(.,.)

Widely used kernel functions:

  • Polynomial kernel
  • Gaussian RBF kernel
  • Sigmoid kernel

Choosing the right kernel with the right hyperparameters

Kernel

Try linear first. If does not work, RBF is probably the best kernel choice (unless you have some prior information on the geometry of your dataset)

Hyperparameters (

C + kernel parameter(s))
grid search and cross-validation

Mutliclass SVM

What if we have more than 2 classes ?

2 possible strategies

one vs all: One SVM model per class

separate the class from all other classes

  • Assign new points with winner takes all rule
  • if no outright winner, assign point to the class of closest hyperplane (Platt scaling)

One versus one: one SVM model per pair of classes

separate 2 classes at a time, ignoring the other data

  • assign new points with majority voting rule

Decision trees

Decision trees use recusrive partitioning to create a sequence of decision rules on input features that nested split of data points

Input features can be numeric (decision

) or categorical (decision
==
)

Decision node

= decision rule for one feature

Classification tree

predict class
Regression tree
predict real number

On the current node, try to apply all the possible decision rules for all features and select the decision that best split the data
Classification tree

impurity riterion
Regression tree
variance reduction

Final decision boundaries

overlapping orthogonal half planes
Decision on new data
running it down through the branches and assign classes

How to split a node

Which split should we choose between

La reponse est goche

Stop recursive partitionning if node is pure

Pros and cons of decision trees

Pros

  • Simple decision rules
  • Surprisingly computationally efficient
  • Handle multiclass problems
  • Handle numeric and categorical features at the same time

Cons

  • Strongly overfit data
    • Bad predictive accuracy

Potential solution
Restrain the growth of the tree by imposing a maximal tree depth

Random forests

Bagging several decision trees

Decision trees are weak classifiers when considered individually

  • Average the decision of several of them
    • Compensate their respective errors (wisdom of crowds)
  • Useless if all decision trees see the same data
    • introduce some variability with bagging (bootstrap aggregating)
  • Introduce more variability by selecting only
    p
    out of
    m
    total features for each split in each decision tree (typically
    p=m
    )

Final decision is taken by majority voting on all decision tree outputs

Decision boundaries comparison

Evaluating regression/classification performances

Cross-validation

k-fold cross validation

  • Divide whole data into
    k
    non-overlapping sample blocks
  • Train
    k
    models on
    (k1)
    training blocks and test on remaining block
  • Compte perf metrics of each model + avergae & standard deviation of all
    k
    models

Confusion matrix