# MLRF: Lecture 04 # Agenda for lecture 4 1. Introduction 2. Content-based image retrievalasl (CBIR) using bags of features 3. Evaluating CBIR / Ranked Retrieval (RR) systems 4. Texture descriptors 5. Character descriptors # Summary of last lecture - Descriptor matching - 1-way - Cross check - Ratio test - Radius threshold - Descriptor indexing - Indexing pipeline: train/query - Linear matching - kD-Trees - FLANN/hierarchical k-Means - LSH - Aproximate NN problem - Projective transformations - Translation - Rotation - Scaling - ... - Projective - Homography estimation - Least square - RANSAC - Geometric validation # Practice session 3: Take home messages *Twin it!*: Extracting descriptors, matching them **by hand** - Detect keypoints and extract surrounding pixels to flat vector - Normalize them and compare them using cross correlation ($\sum_if_i\bullet g_i$) Augmented Documents: Use an off-the-shelf detector/descriptor: ORB Augmented Documents: Projective transforms and Homography estimation - OpenCV provides the solver for machinery: list of matches $to$ $3\times 3$ matrix - Just som coordinate transform (2D $\to$ 2D transform) - Remember the classical matrix forms: **translation, rotation, ...** # Next practice session :::success Implement a simple **image search engine** ::: ![](https://i.imgur.com/uwclkII.png) :::danger Will be graded ::: # Local feature descriptors ## Introduction *Given some keypoints in image 1, what are the more similar ones in image 2 ?* ![](https://i.imgur.com/vDMNGYF.png) :::info This is a **nearest neighbor problem** in **descriptor space** This is also a **geometrical problem** in **coordinate space** ::: :::success Local feature **detectors** give use the same feature under several perturbations: perspective, illumination, blur... ::: Local feature descriptors will associate a **vector to each local feature**. Such description vector should be: - **Compact** - to enable fast indexing and matching - **Discriminative** - to enable object recognition - **Robust to perturbations** - to tolerate real conditions We will focus on 2 widely used descriptors for their pedagogical interest - **HOG** (Histogram of gradients), used im SIFT - **BRIEF** (Binary Robust Independent Elementary Features), used in ORB # Histogram of Gradient ## Algorithm overview 1. (optional) global image normalization 2. Compute the gradient image in $x$ and $y$ 3. Compute gradient hisograms 4. Normalise across blocks 5. Flatten into a feature vector 6. (quantify to integers) ![](https://i.imgur.com/zACmsUy.png) ![](https://i.imgur.com/xSKmwma.png) ## Exemple ![](https://i.imgur.com/vQ4mY5v.png) > Cette sensation quand tu cogne ton coude au niveau du nerf ## Summary ### Pros - Very stable over illuminations changes perpectives changes, blur ### Cons - Slow to compute - Quite large (128 bytes for original SIFT) # BRIEF General idea: 1. Sample pairs of points $$\{p(x), p(y)\}$$ in the **smoothed** (very spiky otherwise, like derivatives) keypoints neighborhood 2. Compute a simple **binary test**: $p(x)\lt p(y)$ 3. Accumulate the results of $n_d$ tests to form a **binary vector of $n_d$ bits** (256 in ref.) ![](https://i.imgur.com/1SDVx4U.png) ## Sampling strategies - GI: $(x_i, y_i)\sim$ i.i.d. Uniform($-\frac{S}{2}$, $+\frac{S}{2}$) - GII: $(x_i, y_i)\sim$ i.i.d. Gaussian($0$, $\frac{1}{25}S^2$) - GII: - $x_i\sim$ i.i.d. Gaussian($0$, $\frac{1}{25}S^2$) - $y_i\sim$ i.i.d. Gaussian($x_i$, $\frac{1}{100}S^2$) - GIV: $(x_i, y_i)$ randomly sampled from discrete locations of a coarse polar grid introducing a spatial quantization - GV: - $x_i=(0,0)$ - $y_i$ takes all possible values on a coarse polar grid containing $n_d$ points ![](https://i.imgur.com/AXKYqZM.png) TODO **What is the best approach ?** :::success C'est la strategie 2 ::: ![](https://i.imgur.com/llHURZA.png) ![](https://i.imgur.com/MBS2vD7.png) ## Summary ### Pros - Very fast to compute - Very fast to match - Very compact to store ### Cons - Less robust than HoG(SIFT), DoH(SURF) on several real cases # Invariance check ## Rotation invariance - Add an angle measure - Take the main gradient orientation - (Take the averga around the keypoints) We now have for each keypoint: - Coordinates - **Orientation** Descriptor: computed over **normalized patch** ![](https://i.imgur.com/zBuVgaq.png) ## Scale invariance Multi scale feature detection and computation: - Add a keypoint for each relevant scale - Possibly several keypoints at the same position We now have for each keypoint: - Coordinates - Orientation - **Scale** Descriptor: computed on a **scaled patch** ![](https://i.imgur.com/yMyMl4u.png) ### Reminder: Gaussian sigma vs window size ![](https://i.imgur.com/tXEAYNh.png) ## Illumation invariance SIFT approach: 1. Normalize the vector - Solves Affine but what non-linear sources like camera saturation? - ![](https://i.imgur.com/SIti7h8.png) 2. Cap the vector elements to $20\%$ (!) and renormalize - Now we have some illumination invariance - ![](https://i.imgur.com/NJb8ayu.png) ## Viewpoint invariance Better, but more complex approaches can tolerate extreme viewpoint change ![](https://i.imgur.com/AM7wRyG.png) # Complete pipelines ## SIFT (Scale invariant feature tr.) 1. Construct scale space 2. Take difference of Gaussians 3. Locate DoG Extrema 4. Sub pixel locate potential feature points 5. Filter edge and low contrast responses 6. Assigne keypoints orientations 7. Build keypoint descriptors 8. Matching, etc. ![](https://i.imgur.com/ULWDf5s.png) ## ORB (oriented FAST and rotated BRIEF) 1. Use FAST in pyramids to detect stable keypoints 2. Select the strongest features using FAST or Harris response 3. Finds their orientation using first-order moments 4. Computes the descriptors using BRIEF - Where the coordinates of random point pairs are rotated according to the measured orientation # Conclusion about feature extraction Selection of appropriate features: - It is a **critical** decision - Depends on the **specific application** Features must: - Be **invariant** to data variations (depending on the application) - rotation - perpective - noise - etc. - Have **low dimensionality** for fast training, matching, reasonable storage Features determine the **type of info** to work with: - gray-level, binary, color image - contours, vectorization, skeleton Features also determine the **type of classifier / indexer** # Content based image retrieval ## Two strategies using local descriptors ### Keep all local descriptors Pros: - Enables geometric validation - better part detection in theory Cons: - Huge memory requirements :::success Like what we did in practice session 3 to match parts of an image (useful to validate geometric constraints and classify an image at the same time) ::: ### Build a global descriptor using local ones - Inspired by text retrieval - Compact representation - Tricks to embed spatial information - Limited memory requirements :::success Like what we did in practice session 2 with the color histogram, at the bubble level ::: :::info **Bag of Feature** approach ::: ## Pipeline with local descriptors (prev. lecture) ![](https://i.imgur.com/kCXVXCa.png) ## Pipeline with bag of features (current lecture) ![](https://i.imgur.com/nJqjwCc.png) # Features extraction ## Sparse vs Dense detection ![](https://i.imgur.com/vvZpbRA.png) ![](https://i.imgur.com/9ksurAN.png) :::info For dense detection, we usually filter regions with low variance ::: ![](https://i.imgur.com/ww37BoR.jpg) ## Dimensionality reduction Often used before encoding to: - limit dictionary sizes - facilitate quantization Several techniques: - Principal Component Analysis - Signualr-Value Decomposition ![](https://i.imgur.com/hk6hNrO.png) # Encoding ## Bag of Visual Words - Modern approaches are derived from this one - Reuses ideas of text/we search to images - **From a set of descriptor, build a histogram of quantized descriptors** much alike a color histogram ## Quantization - Discretization of some signal - **Lossy process!** - Vectorial formulation: $f:R^d\to F$, with $F=\{1,2,...,k\}$ - Defines a **Voronoi diagram**, ie a decomposition of a metric space determined by the distances to a discrete set of points ![](https://i.imgur.com/MwrAUQh.png) ## Bag of Visual Words (continued) - Cluster centers are determined using k-Means (once for all on a training set) - Each descriptor is quantized: store the code of the closest centroid - Build a histogram of descriptor count for each cluster :::info The set of cluster centers is called the dictionary, the codebook or also the visual vocabulary ::: :::warning We can choose the number of words ! ::: ![](https://i.imgur.com/RGjy5BK.png) ![](https://i.imgur.com/RJUhOE7.png) ![](https://i.imgur.com/f1QlwOv.png) ### Vector size The resulting vector size for a given image is given by: $$ D=\text{vocabulary size} $$ Usually, the bigger the vocabulary the better the results. Several thousands of words are common. ### Normalization #### Premiere methode Problem: - The values in the histogram are **absolute**: each bin count the number of occurence of each *visual word* - This make the descriptor **sensitive to the variation of number of descriptors** Solution: - Normalize the histogram #### Seconde technique - Like for text retrieval, it is common to **reweight the BOVW vectors** using the **TFDIF** technique - Goal: give more importance to rare words than to frequent ones - For each dimension of the histogram, compute a new value $t_i$ ![](https://i.imgur.com/URzKx5n.png) ## Variant: Soft BoVW Use soft assignment to clusters, add counts to neighbor bins ![](https://i.imgur.com/a4WbGkI.png) ## Other variants BoVW is only about counting the number of local descriptors ## VLAD: vector of locally aggregated descriptors ![](https://i.imgur.com/fClfKFr.png) ![](https://i.imgur.com/Sr75DHs.jpg) ## Fisher vector ![](https://i.imgur.com/cMgDD7Y.png) # IR evalutation ## How to evaluate a retrieval system ? We need a set of queries for which we know the expected results "Ground truth", aka "targets", "gold standard" ## Precision and recall Used to measure the balance between - Returning many results, hence a lot of the relevant results present in the database, but also a lot of noise - Returning very few results, leading to less noise, but also less relevant results :::info Precision ( P ) is the fraction of retrieved documents that are relevant: ![](https://i.imgur.com/IhJjsv6.png) ::: :::info Recall ( R ) is the fraction of relevant documents that are retrieved ![](https://i.imgur.com/r1Czp0V.png) ::: ![](https://i.imgur.com/8isRfsq.png) ## F-measure :::info F-measure is the wighted harmonic mean of precision and recall ![](https://i.imgur.com/haBncXm.png) ::: ## How to evaluate a **ranked** retrieval system ? When results are ordered, more measures are availables. Common useful measure are: - Precision-recall - ROC graph nd the area under it (AUC) ### Precision-recall graph ![](https://i.imgur.com/kHO9icg.png) #### Mean-average precision ![](https://i.imgur.com/v9hTxL6.png) #### Example: Compute the AP for a given query For this query and the followinf results, plot the precision/recall graph ![](https://i.imgur.com/XHvHYC9.png) :::success 1, 3 et 9 sont pertinents ici ::: ![](https://i.imgur.com/rpHF2Lx.png) ![](https://i.imgur.com/I13Ey9Q.png) *Is the first result relevant ?* Oui - compute current precision: 1 relevant / 1 retrieved = 1 - Recall: 1 relevant Repeter pour chaque resultat ![](https://i.imgur.com/cSj8Sl2.png) :::danger Construction du graphe en dent de scie ::: ![](https://i.imgur.com/Qy0tjaT.png) :::warning Certaines librairies garde les valeurs superieures, **c'est pas bien** ::: Case 2: what if $\vert e_i\vert=4$ ? ![](https://i.imgur.com/aRmeh6t.png) :::danger Ce qu'on veut c'est l'aire sous la courbe :::