MLRF: Lecture 05
Agenda
- Introduction
- Image classification overview
- Some classifiers - part 1
- Classifier evaluation
Summary of last lecture
Content-based image retrieval
- 2 strategies: keep all local descriptors for all images vs 1 descriptor per image
- Bag of Visual Words pipeline
Evaluation of image retrieval systems
- Precision
- Recall
- F-Measure
- mAP
Texture descriptors (on les a pas du tout vu)
- What is a texture ?
- Fast and classic approaches
- Descripteurs a l'ancienne
Practice session 4: Take home messages
BoVW
- Usually requires some preprocessing of the descriptors: centering, rotation/axes permutation, dimensionality reduction…
- Is based on a quantization step (assign descriptors to clusters)
- Is just a histogram, like the color histogram of sessino 2
- We can compute more advanced statistics to get better results (VLAD, FVs)
Best practices:
- Test arrays shapes and types as soon as possible
- Make a small change, test, fix, tes, validate, repeat
- Get a complete, basic pipeline ASAP and improve it until time is over
Next practice session
Implement a simple image classifier:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Steps
- Load resources
- Train a BoVW model
- Split the dataset into training and validation sets
- Compute the BoVW descriptor for each image
- We will make a small change here (sqrt + L2-norm)
- Prepare training structures
- Train a classifier and evaluate its performance
- Training and evaluating is easy with scikit learn
- Display some results
- Test on meme image
- Compute the results on the test set and export them
Image classification overview
Instance recognition vs Class recognition
Instance recognition
Re-recognize a known 2D or 3D rigid object, potentially being viewed from a novel viewpoint, against a cluttered background, and with partial occlusions
Ex: practice session 3
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Class recognition
Recognize any instance of a particular general class such as "cat", "car" or "bicycle"
Aka category-level or generic object recognition
This lecture and next practice session
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Pipeline overview
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Our image classification pipeline
This is a supervised machine learning task
- We need a dataset with samples
- Images will be represented as BoVW vectors of fixed size
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Targets will be encoded as integers
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
This is a very usual data representation for a classification problem
Classifier inputs = "samples" with "features"
Classifier outputs = "labels"
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Now we just need to select an appropriate method, prepare our data, run some training, test the results, adjust some parameters, compare approaches, display results, …
Data preparation
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Training/validation/test separation
- You cannot estimate the generalization performance of your predictor/estimator/classifier on its training set
- You need to keep some samples aside for later evaluation
Other "funny" things to do IRL
- Collect data
- Clean data
- Check data
- Clean again
- Annotate
- Check
- Compute/convert/scale features
Feature selection
Consists in dropping some data columns
Can help later stages:
- Less data to process
- Better properties (like decorrelated features, etc.)
Which columns ?
- Hard problem in general
- Because features may be informative as a group
- Some simpler and helpful techniques:
- Remove features with low variances
- Dimensionality reduction techniques are not exactly feature selection, but can still have a similar effect
Some classifiers - part 1
Disclaimer
What follows is a very limited selection
Only classifiers suitable for image classification as we present it today
input = feature vector
output = label
Many other approaches
What is our goal ?
Given samples (described by features) and true lables, find a good function which will correctly predict labels given new data samples
Parametric vs Non Parametric Classifiers
Parametric examples
Logisitic regression, Linear Discriminant Analysis, naive Bayes, Perceptrion, Simple Neural Networks..
A learning model that summarizes data with a set of parameters of fixed size (independant of the number of training examples) is called a parametric model. No matter how much data you throw in nature
Non-parametric examples
k-Neares Neighbors, Decision Trees, SVMs
"Non-parametric models differ from parametric models int that hte model structure is not specified a priori but is instead determined from data. The term non-parametric is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance"
Wikipedia
"Nonparametric methods are good when you have a lot of data and no prior knowledge"
Dummy classifiers
Say you have a dataset with 9 muffins and 1 chihuahua.
You have a new sample to classify.
Which class should you bet on ?

If your class prior probabilities are not equal, then you should bet on the most frequent class! ()
Without such information, you can just pick at random
Waht is the expected accuracy (true predictions / total predictions) if you have N classes an pick one at random ?
Scikit-learn offers a DummyClassifier class which helps testing such a strategy
What's the point ?
- Quickly build and test your complete pipeline with a mockup classifier
- Quickly get a baseline for the performance
- (look for obvious bias in the dataset, but you should have cleaned it before !)
K Nearest Neighbor (kNN)
Keep all training samples
View new samples as quieries over the previously learned / indexed samples

Assign the class of the closest(s) samples

We can check more than one sample


Remember thi bias/variance compromise ?

Pros
- very simple to implement
- Capacity easily controlled with k
- Can be tuned to work on large datasets: indexing, data cleaning, etc.
- Good baseline
- Non parametric
- Lazy learner
Cons
- In high dimension, all samples tend to be very close (for Euclidean dimension)
- Large memory consumption on large datasets
- Requires a large amount of samples and large k to get best performance
Setting K:
: average number of training sample/class
Other distance-based classifier
Minimal euclidean distance
Very basic classifier
Distance to the mean of the class
It does not take into account differences
in variance for each class
Predicted class for x:

Minimal quadratic distance (Mahalanobis)
For each class , the mean and covariance matrix are computed from the set of examples
The covariance matrix is taken into account when computing the distance from an image to the class
The feature vector of the image is projected over the eigenvectors of the class

A quick introduction to Bayesian Decision Theory
Example - RoboCup



General case: maximum a posteriori (MAP)
General case: need to tale into consideration and
- : class conditional density (here: histograms)
- : class priors, e.g. for indoor RoboCup
- , ,
- : probability of seeing data
Optimal decision rule (Bayes classifier): maximum a posteriori (MAP):
How to compute ?
If classes are equiprobables and error cost is the same, then, because is constant, we get the maximum likelihood estimation:
Generative, discriminant and "direct" classifiers

Generative Probabilistic Models
Some classical Generative Probabilistic Models
Training data , .
For each , build model for of
- Histogram: if can have only a few discrete values
- Kernel Density Estimator

- Gaussian

- Mixture of Gaussians

Typically, small (few possibles lables), low dimensional
Class conditional densities and posteriors

Naive Bayes Classifiers


Linear discriminant classifiers
General idea for binary classification


Learn w and b
Problem: how to learn w and b ?
Logistic Regression
Linear classifier, is logistic function
Optimize to find best
Trained using gradient descent (no closed form solution)

Gradient descent
Formally:
Where is step size, how far to step relative to the gradient

From 2 classes to C classes: 2 strategies


Maximum Margin classification

What is the best for this dataset ?
Trade-off:
large margin vs few mistakes on training set

Support Vector Machin (SVM)

Logistic Regression vs SVM
Optimization problems:

About the regularizer

Effect of cost parameter C (regularization, again)

Non-linear discriminant classifiers
Non-linear classification
What is the best linear classifier for this dataset?

None. We need something nonlinear!
2 solutions:
- Preprocess the data (explicit embedding, kernel trick…)
- Combine multiple linear classifiers into nonlinear classifier (boosting, neural networks…)
Non-linear classification using linear classifiers with data preprocessing
Data preprocessing idea
Transform the dataset to enable linear separability

Linear separation is always possible
The original input space can always be mapped to some higher-dimensional feature space where the training set is separable.

Explicit embedding
Compute for all in the dataset.
Then train a linear classifier just like before
Used to be avoided because of computation issues, but it is a hot topic again.
Kernel trick
Linear classification requires to compute only dot products
The function does not need to be explicit, we can use a kernel function
which represents a dot product in a “hidden” feature space.
This gives a non-linear boundary in the original feature space.
Popular kernel functions in Computer Vision
Linear kernel”: identical solution as linear SVM

“Hellinger kernel”: less sensitive to extreme value in feature vector

“Histogram intersection kernel”: very robust

“-distance kernel”: good empirical results

“Gaussian kernel”: overall most popular kernel in Machine Learning

Explicit embedding for the Hellinger kernel

Using simple square root properties, we have:
Tricks for next practice session: given a BoVW vector,
- L1 normalize it (neutralizes effect of number of descriptors)
- Take its square root (explicit Hellinger embedding)
- L2 normalize it (more linear-classifier friendly)
Metrics
Confusion matrix and Accuracy

Problems with Accuracy
All the following classifiers have a 90% accuracy

Do all errors have the same cost?
Precision, recall, F-score

Plotting a Precision/Recall for classification data
For binary classification
Instead of , take all possible thresholds for

TPR, FPR, ROC
ROC: “Receiver Operating Characteristic”
Kind of signal/noise measure under various tunings

Ligne rose: random results
More about ROC curves:
Adjusting the threshold
http://www.navan.name/roc/

Class overlap

Bootstrap
Draw randomly, with replacement samples from the training set.
Enables us to estimate the variance of estimators we use in the classification rule.

Holdout
Just keep a part of the dataset for later validation/testing

Cross validation


StratifiedKFold (best)

Missing things
- Cost of misclassification
- Multiclass classification evaluation
- …