# MLRF: Lecture 05 # Agenda 1. Introduction 2. Image classification overview 3. Some classifiers - part 1 4. Classifier evaluation # Summary of last lecture Content-based image retrieval - 2 strategies: keep all local descriptors for all images vs **1 descriptor per image** - Bag of Visual Words pipeline - Focus on encoding Evaluation of image retrieval systems - Precision - Recall - F-Measure - mAP Texture descriptors (on les a pas du tout vu) - What is a texture ? - Fast and classic approaches - Descripteurs a l'ancienne # Practice session 4: Take home messages BoVW - Usually requires some **preprocessing** of the descriptors: centering, rotation/axes permutation, dimensionality reduction... - Is based on a **quantization step** (assign descriptors to clusters) - Is **just a histogram**, like the color histogram of sessino 2 - We can compute **more advanced statistics** to get better results (VLAD, FVs) :::warning **Best practices:** - Test arrays shapes and types as soon as possible - Make a small change, test, fix, tes, validate, repeat - Get a complete, basic pipeline ASAP and improve it until time is over ::: # Next practice session Implement a simple **image classifier**: ![](https://i.imgur.com/JRTgDIl.png) :::danger **Will be graded** ::: ## Steps 1. Load resources 2. Train a BoVW model 3. Split the dataset into training and validation sets 4. Compute the BoVW descriptor for each image - We will make a small change here (sqrt + L2-norm) 6. Prepare training structures 7. Train a classifier and evaluate its performance - Training and evaluating is easy with scikit learn 9. Display some results 10. Test on meme image 11. Compute the results on the test set and export them # Image classification overview ## Instance recognition vs Class recognition ### Instance recognition Re-recognize a known 2D or 3D rigid object, potentially being viewed from a novel viewpoint, against a cluttered background, and with partial occlusions *Ex: practice session 3* ![](https://i.imgur.com/zjxx8F3.png) ### Class recognition Recognize any instance of a particular general class such as "cat", "car" or "bicycle" Aka *category-level* or *generic* object recognition :::warning More challenging ::: *This lecture and next practice session* ![](https://i.imgur.com/fvvBXlU.png) ## Pipeline overview ![](https://i.imgur.com/ufF8vdn.png) ## Our image classification pipeline This is a **supervised** machine learning task - We need a dataset with samples - Images will be represented as BoVW vectors of fixed size ![](https://i.imgur.com/iGnNMb9.png) - Targets will be encoded as integers ![](https://i.imgur.com/OV5hiUn.png) - 0: Muffin - 1: Chihuahua This is a very usual data representation for a classification problem :::info Classifier inputs = "samples" with "features" Classifier outputs = "labels" ![](https://i.imgur.com/RUZZam9.png) ::: Now we just need to select an appropriate method, prepare our data, run some training, test the results, adjust some parameters, compare approaches, display results, ... # Data preparation ## NumPy formatting ![](https://i.imgur.com/WnT18Ch.png) ## Training/validation/test separation - **You cannot estimate the generalization performance of your predictor/estimator/classifier on its training set** - You need to keep some samples aside for later evaluation ## Other "funny" things to do IRL - Collect data - Clean data - Check data - Clean again - Annotate - Check - Compute/convert/scale features ## Feature selection :::info Consists in dropping some data columns ::: Can help later stages: - Less data to process - Better properties (like decorrelated features, etc.) Which columns ? - Hard problem in general - Because features may be informative **as a group** - Some simpler and helpful techniques: - Remove features with low variances - Dimensionality reduction techniques are not exactly feature selection, but can still have a similar effect # Some classifiers - part 1 ## Disclaimer :::warning What follows is a very limited selection ::: Only classifiers suitable for image classification as we present it today :::info **input = feature vector** **output = label** ::: Many other approaches ## What is our goal ? :::danger Given samples (described by features) and true lables, find a good function which will correctly predict labels given new data samples ::: ## Parametric vs Non Parametric Classifiers ### Parametric examples Logisitic regression, Linear Discriminant Analysis, naive Bayes, Perceptrion, Simple Neural Networks.. > A learning model that summarizes data with a set of parameters of fixed size (independant of the number of training examples) is called a parametric model. No matter how much data you throw in nature ### Non-parametric examples k-Neares Neighbors, Decision Trees, SVMs > *"Non-parametric models differ from parametric models int that hte model structure is not specified a priori but is instead determined from data. The term non-parametric is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance"* > Wikipedia > *"Nonparametric methods are good when you have a lot of data and no prior knowledge"* ## Dummy classifiers Say you have a dataset with 9 muffins and 1 chihuahua. You have a new sample to classify. **Which class should you bet on ?** ![](https://i.imgur.com/Sy1wvkC.png) If your class prior probabilities $P(C_1), P(C_2),\dots$ are not equal, then you should bet on the **most frequent class!** ($g(x)=argmax_yp(y)$) Without such information, you can just pick at random *Waht is the expected accuracy (true predictions / total predictions) if you have N classes an pick one at random ?* :::info Scikit-learn offers a DummyClassifier class which helps testing such a strategy ::: ### What's the point ? 1. Quickly build and test your complete pipeline with a mockup classifier 2. Quickly get a baseline for the performance 3. (look for obvious bias in the dataset, but you should have cleaned it before !) ## K Nearest Neighbor (kNN) Keep all training samples View new samples as quieries over the previously learned / indexed samples ![](https://i.imgur.com/vGq3eFP.png) Assign the class of the closest(s) samples $$ f(x) = y_i, i = argmin_j\Vert x_j-x\Vert $$ ![](https://i.imgur.com/13qgkgU.png) We can check more than one sample ![](https://i.imgur.com/yQB482y.png) ![](https://i.imgur.com/NYzoX1K.png) ### Remember thi bias/variance compromise ? ![](https://i.imgur.com/LxXw7gB.png) ### Pros - very simple to implement - Capacity easily controlled with k - Can be tuned to work on large datasets: indexing, data cleaning, etc. - Good baseline - Non parametric - Lazy learner ### Cons - In high dimension, all samples tend to be very close (for Euclidean dimension) - Large memory consumption on large datasets - Requires a large amount of samples and large k to get best performance :::danger **Setting K:** $$ K\simeq\sqrt{\frac{m}{C}} $$ $\frac{m}{C}$: average number of training sample/class ::: ## Other distance-based classifier ### Minimal euclidean distance **Very basic classifier** Distance to the mean $m_i$ of the class It does not take into account differences in variance for each class Predicted class for x: $$ g(x) = argmin_iD_i(x) $$ ![](https://i.imgur.com/zTOkAeB.png) ### Minimal quadratic distance (Mahalanobis) For each class $i$, the mean $m_i$ and covariance matrix $S_i$ are computed from the set of examples The covariance matrix is taken into account when computing the distance from an image to the class $i$ The feature vector of the image $x$ is projected over the eigenvectors of the class $$ g(x) = argmin_iD_i(x) $$ ![](https://i.imgur.com/iBAfkDT.png) # A quick introduction to Bayesian Decision Theory ## Example - RoboCup ![](https://i.imgur.com/6znaC7D.png) ![](https://i.imgur.com/gS6isnG.png) ![](https://i.imgur.com/nkITm9Y.png) ## General case: maximum a posteriori (MAP) :::info General case: need to tale into consideration $p(y)$ and $p(x)$ ::: - $p(x\vert y)$: class conditional density (here: histograms) - $p(y)$: class priors, e.g. for indoor RoboCup - $p(floor) = 0.6$, $p(goal) = 0.3$, $p(ball) = 0.1$ - $p(x)$: probability of seeing data $x$ Optimal decision rule (Bayes classifier): maximum a posteriori (MAP): $$ g(x) = argmax_{y\in Y}p(y\vert x) $$ ### How to compute $p(y\vert x)$ ? $$ p(y\vert x) = \frac{p(x\vert y)p(y)}{p(x)}\quad\text{Bayes' rule} $$ If classes are equiprobables and error cost is the same, then, because $p(x)$ is constant, we get the maximum likelihood estimation: $$ g(x) = \underbrace{argmax_{y\in Y}p(y\vert x)}_{\text{MAP}}\simeq\underbrace{argmax_{y\in Y}p(x\vert y)}_{\text{ML}} $$ ## Generative, discriminant and "direct" classifiers ![](https://i.imgur.com/UlPjAcK.png) # Generative Probabilistic Models ## Some classical Generative Probabilistic Models Training data $X=\{x-1,\dots,x_n\}$, $Y=\{y_1,\dots,x_n\}$. $X\times Y\in\mathcal X\times\mathcal Y$ For each $y\in\mathcal Y$, build model for $p(x\vert y)$ of $X_y:=\{x_i\in X:y_i=y\}$ - Histogram: if $x$ can have only a few discrete values - Kernel Density Estimator ![](https://i.imgur.com/ZIJ8gxo.png) - Gaussian ![](https://i.imgur.com/KFvVTpi.png) - Mixture of Gaussians ![](https://i.imgur.com/KDKezfM.png) Typically, $\mathcal Y$ small (few possibles lables), $\mathcal X$ low dimensional ## Class conditional densities and posteriors ![](https://i.imgur.com/DUCdiKg.png) ## Naive Bayes Classifiers ![](https://i.imgur.com/E5Pz6Qx.png) ![](https://i.imgur.com/VxjSGbd.png) # Linear discriminant classifiers ## General idea for binary classification ![](https://i.imgur.com/l6uLtyC.png) ![](https://i.imgur.com/H4NM4Za.png) Learn w and b - you can compute $p(y\vert x)\simeq\hat y$ :::warning Problem: how to learn w and b ? ::: ## Logistic Regression Linear classifier, $f$ is logistic function $$ \sigma(x) = \frac{1}{(1+e^{-x})} = \frac{e^x}{(1+e^x)} $$ - Maps all real $\to[0,1]$ Optimize $\sigma(w^Tx+b)$ to find best $w$ Trained using gradient descent (no closed form solution) ![](https://i.imgur.com/RZoBc0Y.png) ## Gradient descent Formally: $$ w_{t+1}=w_t-\eta\nabla L(w) $$ Where $\eta$ is *step size*, how far to step relative to the gradient ![](https://i.imgur.com/eXfRl8J.png) ## From 2 classes to C classes: 2 strategies ![](https://i.imgur.com/hB0yq8C.png) ![](https://i.imgur.com/E8QRm0Z.png) $$ \hat y = argmax_{i\in Y}w_ix $$ ## Maximum Margin classification ![](https://i.imgur.com/8zbmVyB.png) What is the best $w$ for this dataset ? Trade-off: large margin vs few mistakes on training set ![](https://i.imgur.com/3tH5H3Z.png) ## Support Vector Machin (SVM) ![](https://i.imgur.com/AMNYWuJ.png) ## Logistic Regression vs SVM Optimization problems: ![](https://i.imgur.com/NGEFBBe.png) ## About the regularizer ![](https://i.imgur.com/zY3MiZQ.png) ## Effect of cost parameter C (regularization, again) ![](https://i.imgur.com/QEjgzaK.png) # Non-linear discriminant classifiers ## Non-linear classification What is the best linear classifier for this dataset? ![](https://i.imgur.com/4IRGUzO.png) :::success None. We need something nonlinear! ::: 2 solutions: 1. Preprocess the data (explicit embedding, kernel trick…) 2. Combine multiple linear classifiers into nonlinear classifier (boosting, neural networks…) # Non-linear classification using linear classifiers with data preprocessing ## Data preprocessing idea Transform the dataset to enable linear separability ![](https://i.imgur.com/iJr7bVs.png) ## Linear separation is always possible The original input space can always be mapped to some higher-dimensional feature space where the training set is separable. ![](https://i.imgur.com/21JoSx4.png) ## Explicit embedding Compute $\phi(x)$ for all $x$ in the dataset. Then train a linear classifier just like before :::warning Used to be avoided because of computation issues, but it is a hot topic again. ::: ## Kernel trick Linear classification requires to compute only dot products $\phi(x_i),\phi(x_j)$ **The function $\phi(x)$ does not need to be explicit,** we can use a kernel function $$ k(x,z)=\phi(x)\phi(z) $$ which represents a dot product in a “hidden” feature space. :::success This gives a non-linear boundary in the original feature space. ::: ## Popular kernel functions in Computer Vision Linear kernel”: identical solution as linear SVM ![](https://i.imgur.com/dtD4CbJ.png) “Hellinger kernel”: less sensitive to extreme value in feature vector ![](https://i.imgur.com/Cz0oSkY.png) “Histogram intersection kernel”: very robust ![](https://i.imgur.com/CDbNlrq.png) “$X^2$-distance kernel”: good empirical results ![](https://i.imgur.com/dG3QTIm.png) “Gaussian kernel”: overall most popular kernel in Machine Learning ![](https://i.imgur.com/f0yjvx6.png) ## Explicit embedding for the Hellinger kernel ![](https://i.imgur.com/bakMDIs.png) Using simple square root properties, we have: $$ k(x,x’) = \phi(x)\phi(x’) = \sqrt{x} \sqrt{x'} $$ Tricks for next practice session: given a BoVW vector, 1. L1 normalize it (neutralizes effect of number of descriptors) 2. Take its square root (explicit Hellinger embedding) 3. L2 normalize it (more linear-classifier friendly) # Metrics ## Confusion matrix and Accuracy ![](https://i.imgur.com/OZXWA8Q.png) ## Problems with Accuracy All the following classifiers have a 90% accuracy ![](https://i.imgur.com/2ciC5Uz.png) *Do all errors have the same cost?* ## Precision, recall, F-score ![](https://i.imgur.com/VMpNMQH.png) ## Plotting a Precision/Recall for classification data For binary classification Instead of $\hat y = argmax_yp(y\vert x)$, take **all possible thresholds** for $p(y\vert x)$ ![](https://i.imgur.com/VtynSfU.png) ## TPR, FPR, ROC ROC: “Receiver Operating Characteristic” Kind of signal/noise measure under various tunings ![](https://i.imgur.com/k3BeCv0.png) Ligne rose: random results ## More about ROC curves: ### Adjusting the threshold [http://www.navan.name/roc/](http://www.navan.name/roc/) ![](https://i.imgur.com/2CQamZr.png) ### Class overlap ![](https://i.imgur.com/BWAThY5.png) # Split the dataset to assess generalization performance ## Bootstrap Draw randomly, with replacement samples from the training set. Enables us to estimate the variance of estimators we use in the classification rule. ![](https://i.imgur.com/ZuxQOyR.png) ## Holdout Just keep a part of the dataset for later validation/testing ![](https://i.imgur.com/513iOG6.png) ## Cross validation ![](https://i.imgur.com/4KkbALB.png) ### with meta parameter tuning ![](https://i.imgur.com/vf2xEDm.png) ## StratifiedKFold (best) ![](https://i.imgur.com/IELUcaX.png) ## Missing things - Cost of misclassification - Multiclass classification evaluation - ...