# Documentation of eYSIP ## Image Segmentation * Point, Line, Edge Detection * Thresholding - various methods and their advantages * Contours and hierarchy, use and application * Region-based Segmentation In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as image objects). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. ![](https://i.imgur.com/Pk9nyYa.png) Segmentation algorithms generally are based on 2 basic properties of gray level values: 1. Discontinuity - isolated points, lines and edges of image. 2. Similarity - thresholding, region growing, region splitting and merging. ### Point detection Based on the fact that a second order derivative is very sensitive to sudden changes we will use it to detect an isolated point. • A point has been detected at the location p(i,j) on which the mask is centered if |R |>T, where T is a nonnegative threshold, and R is obtained with the following mask. ![](https://i.imgur.com/8LZWWUu.png) • The idea is that the gray level of an isolated point will be quite different from the gray level of its neighbors. ![](https://i.imgur.com/1BzuKFv.png) ### Line detection We can use the Laplacian also for detection of line, since it is sensitive to sudden changes and thin lines. * We must note that since the second derivative changes its sign on a line it creates a “double line effect” and it must be handled. * Second derivative can have negative results and we need to scale the results. * If we would like to detect lines on a certain direction only we might want to use masks that would emphasize a certain direction and be less sensitive to other directions. ![](https://i.imgur.com/jWIxsmI.png) ### Edge detection Most of the shape information of an image is enclosed in edges. So first we detect these edges in an image and by using some filters and then by enhancing those areas of image which contains edges, sharpness of the image will increase and image will become clearer.The properties of edges are: * It locates sharp changes in the intensity function. * Edges are pixels where brightness changes abruptly. * A change of the image function can be described by a gradient that points in the direction of the largest growth of the image function. * An edge is a property attached to an individual pixel and is calculated from the image function behavior in a neighborhood of the pixel. * Magnitude of the first derivative detects the presence of the edge. * Sign of the second derivative determines whether the edge pixel lies on the dark sign or light side. ![](https://i.imgur.com/xu78Uvb.png) Three different edge types are observed: * Step edge – Transition of intensity level over 1 pixel only in ideal, or few pixels on a more practical use * Ramp edge – A slow and graduate transition * Roof edge – A transition to a different intensity and back. Some kind of spread line. **Derivatives** : We will use derivatives (first and second) in order to discover discontinuities. * If we observe a line within an image we can consider it as a one-dimensional function f(x). We will now define the first derivative simply as the difference between two adjacent pixels. ![](https://i.imgur.com/P995UQi.png) ** Derivative Properties:** * First derivative generally produce thicker edges in an image * Second derivative has a very strong response to fine details and noise * Second derivative sign can be used to determine transition direction. ![](https://i.imgur.com/XpkkUk8.png) The vector has the important geometrical property that it points in the direction of the greatest rate of change of f at location (x,y). The magnitude of the gradient is the value of the rate of change in the direction of the gradient vector. ![](https://i.imgur.com/djAUWYy.png) The direction of the gradient with respect to the xaxis : ![](https://i.imgur.com/vbtZ0YZ.png) * Instead of computing the partial derivatives at every pixel location in the image, we will use approximation. * Using 3x3 neighborhood centered about a point. * For gy : Subtract the pixel in the left column from the pixel in the right column. * For gx : Subtract the pixel in the top row from the pixel in the bottom row. Here are some of the masks for edge detection that we will discuss. * Prewitt Operator * Sobel Operator * Laplacian Operator. * Canny Operator **Sobel Operator :** The sobel is one of the most commonly used edge detectors. It is based on convolving the image with a small, separable, and integer valued filter in horizontal and vertical direction and is therefore relatively inexpensive in terms of computations. The Sobel edge enhancement filter has the advantage of providing differentiating (which gives the edge response) and smoothing (which reduces noise) concurrently. ![](https://i.imgur.com/SRM75xd.png) OUTPUT: ![](https://i.imgur.com/rwCjTBa.png) ------------------------------------------- **Prewitt’s Operator :** Prewitt operator is similar to the Sobel operator and is used for detecting vertical and horizontal edges in images. However, unlike the Sobel, this operator does not place any emphasis on the pixels that are closer to the center of the mask. ![](https://i.imgur.com/BSJT2Bp.png) Output: ![](https://i.imgur.com/OG7TaEU.png) ------------------------------------------------- **Laplacian Operator :** Laplacian is somewhat different from the methods we have discussed so far. Unlike the Sobel and Prewitt’s edge detectors, the Laplacian edge detector uses only one kernel. It calculates second order derivatives in a single pass. Two commonly used small kernels are: ![](https://i.imgur.com/XIu6ZKS.png) Because these masks are approximating a second derivative measurement on the image, they are very sensitive to noise. To correct this, the image is often Gaussian smoothed before applying the Laplacian filter. We can also convolve gaussian mask with the Laplacian mask and apply to the image in one pass. OUTPUT: ![](https://i.imgur.com/Wrup386.png) ---------------------------------------- **Canny Operator :** Canny edge detector is probably the most commonly used and most effective method. It’s much more complex edge detecting method then the ones described above. * Smooth the image with a Gaussian filter to reduce noise. * Compute gradient of using any of the gradient operators Sobel or Prewitt. * Extract edge points: Non-maximum suppression. * Linking and thresholding: Hysteresis First two steps are very straight forward, note that in the second step we are also computing the orientation of gradients >“theta = arctan(Gy / Gx)” Gy and Gx are gradient x direction and y direction respectively. ![](https://i.imgur.com/xIp5feM.png) ------------------------------------- **Roberts Operator :** This one is also the simple methods. Two small filters of size 2 x 2 are used for edge detection. ![](https://i.imgur.com/DnloOtw.png) and these are the result of those two small filters. ![](https://i.imgur.com/MVxozLF.png) --------------------------------------------------- ## Thresholding * Thresholding is one of the most important approaches to image segmentation. * If background and object pixels have gray levels grouped into 2 dominant modes, they can be separated with a threshold easily. ![](https://i.imgur.com/ivTbEYb.png) * Thresholding may be viewed as an operation that involves tests against a function T of the form `T=T[x,y,p(x,y),f(x,y)],` where f(x,y) is the gray level of point (x,y), and p(x,y) denotes some local property of this point such as the average gray level of a neighborhood centered on (x,y). * Special cases: If T depends on 1. f(x,y) only - global threshold 2. Both f(x,y) & p(x,y) - local threshold 3. (x,y) - dynamic threshold * Multilevel thresholding is in general less reliable as it is difficult to establish effective thresholds to isolate the regions of interest. ### Simple Thresholding - The simplest approach to segment an image is using thresholding. `If f(x, y) > T then f(x, y) = 0 else f(x, y) = 255` ![](https://i.imgur.com/LwQxPOj.png) **Drawbacks:** * Establishing a threshold is not trivial. * It uses a fixed threshold for all the pixels. * It works only if the intensity histogram of the input image contains neatly separated peaks corresponding to the desired object(s) and background(s). * It cannot deal with images containing, for example, a strong illumination gradient. ### Adaptive Thresholding Adaptive thresholding changes the threshold dynamically over the image, to handle changing lighting conditions in the image, e.g. those occurring as a result of a strong illumination gradient or shadows. * Local adaptive thresholding select the threshold based on the analysis of local neighbourhood area. * The assumption is that smaller image regions are more likely to have approximately uniform illumination. * This allows for thresholding of an image whose global intensity histogram doesn't contain distinctive peaks. ![](https://i.imgur.com/y5c43pi.png) * Typical methods are: 1. Adaptive Mean thresholding: the threshold value is the mean of the neighbourhood area 2. Adaptive Gaussian thresholding: the threshold value is the weighted sum of neighbourhood values where weights are a gaussian window. ### Otsu’s Binarization In global thresholding, we used an arbitrary value for threshold value, right? So, how can we know a value we selected is good or not? Answer is, trial and error method. But consider a bimodal image (In simple words, bimodal image is an image whose histogram has two peaks). For that image, we can approximately take a value in the middle of those peaks as threshold value, right ? That is what Otsu binarization does. So in simple words, it automatically calculates a threshold value from image histogram for a bimodal image. (For images which are not bimodal, binarization won’t be accurate.) So what we do is we calculate the total variance taking an arbitary value and we try to decrease variance as minimum as possible using iterative procedure and after every iteration, we change the threshold value to decrease the variance. ![](https://i.imgur.com/Jg5DvWK.png) --------------------------- -------------------------------------- ## Contours Contours can be explained simply as a curve joining all the continuous points (along the boundary), having same color or intensity. The contours are a useful tool for shape analysis and object detection and recognition. So how do we know that all the ------------------------------------------------- ## Region-based Segmentation In previous methods, we partition an image into regions by finding boundaries between regions based on intensity discontinuities. * Here, segmentation is accomplished via thresholds based on the distribution of pixel properties, such as intensity or color. * Basic formulation: Let R represents the entire image which is partitioned into subregions R1, R2,...Rn such that ![](https://i.imgur.com/7rd7M2T.png) * Physical meaning of the formulation: 1. The segmentation must be complete, i.e. every point must be in a region. 2. Points in a region must be connected. 3. The regions must be disjoint. 4. It deals with the properties that must be satisfied by the pixels in a segmented region - for example P(Ri)=true if all pixels in Ri have the same intensity. 5. Regions Ri and Rj are different in the sense of predicate P. 6. The regions must be disjoint. 7. It deals with the properties that must be satisfied by the pixels in a segmented region - for example P(Ri)=true if all pixels in Ri have the same intensity. 8. Regions Ri and Rj are different in the sense of predicate P. ### Region growing by pixel aggregation * Region growing is a procedure that groups pixels or subregions into larger regions. * Pixel aggregation starts with a set of "seed" points from those grows by appending to each seed point those neighboring pixels that have similar properties such as gray level, texture and color. ![Uploading file..._xfkg0okh6]() ![](https://i.imgur.com/MMjxLcB.png) ### Region splitting and merging * To subdivide an image initially into a set of arbitrary, disjointed regions and then merge and/or split the regions in an attempt to satisfy the conditions stated above. * A split and merge algorithm is summarized by the following procedure in which, at each step, we: 1. split into 4 disjointed quadrants any regions Ri where ( ) P Ri =false; 2. merge any adjacent regions Rj and R k for which P(Ri ∪ Rj) =true; and 3. stop when no further merging or splitting is possible. ![](https://i.imgur.com/25R7u8K.png) * Example ![](https://i.imgur.com/IaA9G8O.png) ---> The entire image is split into 4 quadrants. ---> Only the top left region satisfies the predicate so it is not changed, while the other 3 quadrants are split into subquadrants. ---> At this point several regions can be merged, with the exception of the 2 subquadrants that include the lower part of the object; these do not satisfy the predicate and must be split further. ![](https://i.imgur.com/jW1YmRO.png) * Image segmentation is a preliminary step in most automatic pictorial pattern-recognition and scene analysis problems. * The choice of one segmentation technique over another is dicated mostly by the peculiar ------------------------------------------- Resources: [link](https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html) [link](https://towardsdatascience.com/image-processing-class-egbe443-5-edge-and-contour-d5d410f4483c) ------------------------------------------- ------------------------------------------- ## Feature detection, matching, alignment and Image stitching * Feature detectors, descriptors, matching, tracking methods8 * 2D and 3D feature-based alignment algorithms, examples of image stitching in panorama * Pose estimation algorithms Most of you will have played the jigsaw puzzle games. You get a lot of small pieces of an image, where you need to assemble them correctly to form a big real image. The question is, how you do it? What about the projecting the same theory to a computer program so that computer can play jigsaw puzzles? If the computer can play jigsaw puzzles, why can't we give a lot of real-life images of a good natural scenery to computer and tell it to stitch all those images to a big single image? If the computer can stitch several natural images to one, what about giving a lot of pictures of a building or any structure and tell computer to create a 3D model out of it? Well, the questions and imaginations continue. But it all depends on the most basic question: How do you play jigsaw puzzles? How do you arrange lots of scrambled image pieces into a big single image? How can you stitch a lot of natural images to a single image? The answer is, we are looking for specific patterns or specific features which are unique, can be easily tracked and can be easily compared. If we go for a definition of such a feature, we may find it difficult to express it in words, but we know what they are. If someone asks you to point out one good feature which can be compared across several images, you can point out one. That is why even small children can simply play these games. We search for these features in an image, find them, look for the same features in other images and align them. That's it. (In jigsaw puzzle, we look more into continuity of different images). All these abilities are present in us inherently. Feature detection and matching is an important task in many computer vision applications, such as structure-from-motion, image retrieval, object detection, and more. ### Application Of Feature Detection And Matching * Automate object tracking * Point matching for computing disparity * Stereo calibration(Estimation of the fundamental matrix) * Motion-based segmentation * Recognition * 3D object reconstruction * Robot navigation * Image retrieval and indexing ### Feature A feature is a piece of information which is relevant for solving the computational task related to a certain application. Features may be specific structures in the image such as points, edges or objects. Features may also be the result of a general neighborhood operation or feature detection applied to the image. **Types of image features** * Edges * Corners / interest points * Blobs / regions of interest points * Ridges ![](https://i.imgur.com/lePUZqj.png) Wherever you move the blue patch it looks the same. The black patch has an edge. If you move it in vertical direction (i.e. along the gradient) it changes. Moved along the edge (parallel to edge), it looks the same. And for red patch, it is a corner. Wherever you move the patch, it looks different, means it is unique. So basically, corners are considered to be good features in an image. (Not just corners, in some cases blobs are considered good features). **Feature Point** is the point which is expressive in texture. Interest point is the point at which the direction of the boundary of the object changes abruptly or intersection point between two or more edge segments. **Possible Approaches** * Based on the change in brightness(derivative). * Based on Boundary extraction(Edge detection and Curvature analysis). ### Descriptor Descriptors encode interesting information into a series of numbers and act as a sort of numerical “fingerprint” that can be used to differentiate one feature from another. This information should be invariant under image transformation, so we can find the feature again even if the image is transformed in some way. After detecting interest point we go on to compute a descriptor for every one of them. Descriptors can be categorized into two classes: 1. Local Descriptor: It is a compact representation of a point’s local neighborhood. Local descriptors try to resemble shape and appearance only in a local neighborhood around a point and thus are very suitable for representing it in terms of matching. 2. Global Descriptor: A global descriptor describes the whole image. They are generally not very robust as a change in part of the image may cause it to fail as it will affect the resulting descriptor. ## Harris Corner Detector Harris Corner Detector is a corner detection operator that is commonly used in computer vision algorithms to extract corners and infer features of an image. ![](https://i.imgur.com/2ETxlPb.png) A corner is a point whose local neighborhood stands in two dominant and different edge directions. In other words, a corner can be interpreted as the junction of two edges, where an edge is a sudden change in image brightness.[3] Corners are the important features in the image, and they are generally termed as interest points which are invariant to translation, rotation and illumination. Although corners are only a small percentage of the image, they contain the most important features in restoring image information. The idea is to consider a small window around each pixel p in an image. We want to identify all such pixel windows that are unique. Uniqueness can be measured by shifting each window by a small amount in a given direction and measuring the amount of change that occurs in the pixel values. It basically finds the difference in intensity for a displacement of (u,v) in all directions. This is expressed as below: ![](https://i.imgur.com/TpPlMdk.png) The window function is either a rectangular window or a Gaussian window which gives weights to pixels underneath. We have to maximize this function E(u,v) for corner detection. That means we have to maximize the second term. Applying Taylor Expansion to the above equation and using some mathematical steps (please refer to any standard text books you like for full derivation), we get the final equation as: ![](https://i.imgur.com/OctndiA.png) Ix and Iy are image derivatives in x and y directions respectively. A score, R, is calculated for each window: ![](https://i.imgur.com/qXfgktr.png) λ1 and λ2 are the eigenvalues of M So the magnitudes of these eigenvalues decide whether a region is a corner, an edge, or flat. It can be represented in a nice picture as follows: ![Uploading file..._yagpvw9p6]() ## SIFT (Scale-Invariant Feature Transform) A corner may not be a corner if the image is scaled. For example, check a simple image below. A corner in a small image within a small window is flat when it is zoomed in the same window. So Harris corner is not scale invariant. ![](https://i.imgur.com/oiwKGdO.png) There are mainly five steps involved in SIFT algorithm. We will see them one-by-one. ### 1.Scale-space Extrema Detection * **Scale space** : The scale space of an image is a function L(x,y,σ) that is produced from the convolution of a Gaussian kernel(Blurring) at different scales with the input image. Scale-space is separated into octaves and the number of octaves and scale depends on the size of the original image. So we generate several octaves of the original image. Each octave’s image size is half the previous one. * **Blurring** : Within an octave, images are progressively blurred using the Gaussian Blur operator. * **DOG(Difference of Gaussian kernel)** : Difference of Gaussian is obtained as the difference of Gaussian blurring of an image with two different σ, let it be σ and kσ. This process is done for different octaves of the image in Gaussian Pyramid. It is represented in below image: ![](https://i.imgur.com/qII5aeB.png) * Once this DoG is found, images are searched for local extrema over scale and space. For eg, one pixel in an image is compared with its 8 neighbours as well as 9 pixels in next scale and 9 pixels in previous scales. If it is a local extrema, it is a potential keypoint. It basically means that keypoint is best represented in that scale. It is shown in below image: ![](https://i.imgur.com/wlvDC0L.jpg) ### 2. Keypoint Localization Scale-space extrema detection produces too many keypoint candidates, some of which are unstable. The next step in the algorithm is to perform a detailed fit to the nearby data for accurate location, scale, and ratio of principal curvatures. This information allows points to be rejected that have low contrast (and are therefore sensitive to noise) or are poorly localized along an edge. They used Taylor series expansion of scale space to get a more accurate location of extrema, and if the intensity at this extrema is less than a threshold value (0.03 for example), it is rejected. DoG has a higher response for edges, so edges also need to be removed. They used a 2x2 Hessian matrix (H) to compute the principal curvature. ![](https://i.imgur.com/QuMJ4tc.png) ### 3.Orientation Assignment Now an orientation is assigned to each keypoint to achieve invariance to image rotation. A neighbourhood is taken around the keypoint location depending on the scale, and the gradient magnitude and direction is calculated in that region. An orientation histogram with 36 bins covering 360 degrees is created (It is weighted by gradient magnitude and gaussian-weighted circular window with σ equal to 1.5 times the scale of keypoint). The highest peak in the histogram is taken and any peak above 80% of it is also considered to calculate the orientation. It creates keypoints with same location and scale, but different directions. It contribute to stability of matching. ![](https://i.imgur.com/8cqSgCe.png) ### 4. Keypoint Descriptor Now keypoint descriptor is created. A 16x16 neighbourhood around the keypoint is taken. It is divided into 16 sub-blocks of 4x4 size. ![](https://i.imgur.com/JoHX3hy.png) For each sub-block, 8 bin orientation histogram is created. ![](https://i.imgur.com/7qopNi6.png) So a total of 128 bin values are available. It is represented as a vector to form keypoint descriptor. In addition to this, several measures are taken to achieve robustness against illumination changes, rotation etc. ### 5. Keypoint Matching Keypoints between two images are matched by identifying their nearest neighbours. But in some cases, the second closest-match may be very near to the first. It may happen due to noise or some other reasons. In that case, ratio of closest-distance to second-closest distance is taken. If it is greater than 0.8, they are rejected. It eliminates around 90% of false matches while discards only 5% correct matches. ----------------------------------------- ## SURF (Speeded-Up Robust Features) In SIFT, Lowe approximated Laplacian of Gaussian with Difference of Gaussian for finding scale-space. **SURF** goes a little further and approximates LoG with Box Filter. Below image shows a demonstration of such an approximation. One big advantage of this approximation is that, convolution with box filter can be easily calculated with the help of integral images. And it can be done in parallel for different scales. Also the SURF rely on determinant of Hessian matrix for both scale and location. ![](https://i.imgur.com/nXBdE8T.png) ### Feature Extraction **Integral Image** is used as a quick and effective way of calculating the sum of values (pixel values) in a given image — or a rectangular subset of a grid. They allow for fast computation of box type convolution filters. The entry of an integral image I_∑ (x) at a location x = (x,y)ᵀ represents the sum of all pixels in the input image I within a rectangular region formed by the origin and x. ![](https://i.imgur.com/zL4BN6s.png) **Hessian matrix-based interest points**: Surf uses the Hessian matrix because of its good performance in computation time and accuracy. Rather than using a different measure for selecting the location and the scale (Hessian-Laplace detector), surf relies on the determinant of the Hessian matrix for both. Given a pixel, the Hessian of this pixel is something like: For adapt to any scale, we filtered the image by a Gaussian kernel, so given a point X = (x, y), the Hessian matrix H(x, σ) in x at scale σ is defined as: ![](https://i.imgur.com/f03pUi1.png) For adapt to any scale, we filtered the image by a Gaussian kernel, so given a point X = (x, y), the Hessian matrix H(x, σ) in x at scale σ is defined as: ![](https://i.imgur.com/oY2lI2x.png) where Lxx(x, σ) is the convolution of the Gaussian second order derivative with the image I in point x, and similarly for Lxy (x, σ) and Lyy (x, σ). In order to calculate the determinant of the Hessian matrix, first we need to apply convolution with Gaussian kernel, then second-order derivative. After Lowe’s success with LoG approximations(SIFT), SURF pushes the approximation(both convolution and second-order derivative) even further with box filters. ![](https://i.imgur.com/4vAkw2O.png) The 9 × 9 box filters in the above images are approximations for Gaussian second order derivatives with σ = 1.2. We denote these approximations by Dxx, Dyy, and Dxy. Now we can represent the determinant of the Hessian (approximated) as: ![](https://i.imgur.com/qI1vT5X.png) ### Feature Description **Orientation Assignment**: In order to be invariant to rotation, surf tries to identify a reproducible orientation for the interest points. For achieving this: * Surf first calculate the Haar-wavelet responses in x and y-direction, and this in a circular neighborhood of radius 6s around the keypoint, with s the scale at which the keypoint was detected. Also, the sampling step is scale dependent and chosen to be s, and the wavelet responses are computed at that current scale s. Accordingly, at high scales the size of the wavelets is big. Therefore integral images are used again for fast filtering. * Then we calculate the sum of vertical and horizontal wavelet responses in a scanning area, then change the scanning orientation (add π/3), and re-calculate, until we find the orientation with largest sum value, this orientation is the main orientation of feature descriptor ![](https://i.imgur.com/HzD85SC.png) **Descriptor Components**: * The first step consists of constructing a square region centered around the keypoint and oriented along the orientation we already got above. The size of this window is 20s. * Then the region is split up regularly into smaller 4 × 4 square sub-regions. For each sub-region, we compute a few simple features at 5×5 regularly spaced sample points. For reasons of simplicity, we call dx the Haar wavelet response in the horizontal direction and dy the Haar wavelet response in the vertical direction (filter size 2s). To increase the robustness towards geometric deformations and localization errors, the responses dx and dy are first weighted with a Gaussian (σ = 3.3s) centered at the keypoint. ![](https://i.imgur.com/pXkCKb4.png) Then, the wavelet responses dx and dy are summed up over each subregion and form a first set of entries to the feature vector. In order to bring in information about the polarity of the intensity changes, we also extract the sum of the absolute values of the responses, |dx| and |dy|. Hence, each sub-region has a four-dimensional descriptor vector v for its underlying intensity structure V = (∑ dx, ∑ dy, ∑|dx|, ∑|dy|). This results in a descriptor vector for all 4×4 sub-regions of length 64(In Sift, our descriptor is the 128-D vector, so this is part of the reason that SURF is faster than Sift). ## FAST Features from accelerated segment test (FAST) is a corner detection method, which could be used to extract feature points and later used to track and map objects in many computer vision tasks. The FAST corner detector is very suitable for real-time video processing application because of this high-speed performance. ### Feature Detection using FAST * Select a pixel p in the image which is to be identified as an interest point or not. Let its intensity be Ip. * Select appropriate threshold value t. * Consider a circle of 16 pixels around the pixel under test. (See the image below) ![](https://i.imgur.com/k27XR2x.png) * Now the pixel p is a corner if there exists a set of n contiguous pixels in the circle (of 16 pixels) which are all brighter than Ip+t, or all darker than Ip−t. (Shown as white dash lines in the above image). n was chosen to be 12. * To make the algorithm fast, first compare the intensity of pixels 1, 5, 9 and 13 of the circle with Ip. As evident from the figure above, at least three of these four pixels should satisfy the threshold criterion so that the interest point will exist. * If at least three of the four-pixel values — I1, I5, I9, I13 are not above or below Ip + t, then p is not an interest point (corner). In this case reject the pixel p as a possible interest point. Else if at least three of the pixels are above or below Ip + t, then check for all 16 pixels and check if 12 contiguous pixels fall in the criterion. * Repeat the procedure for all the pixels in the image. There are a few limitations to the algorithm. First, for n<12, the algorithm does not work very well in all cases because when n<12 the number of interest points detected are very high. Second, the order in which the 16 pixels are queried determines the speed of the algorithm. A machine learning approach has been added to the algorithm to deal with these issues. ### Machine Learning Approach * Select a set of images for training (preferably from the target application domain) * Run FAST algorithm in every image to find feature points. * For every feature point, store the 16 pixels around it as a vector. Do it for all the images to get feature vector p. * Each pixel (say x) in these 16 pixels can have one of the following three states: ![](https://i.imgur.com/wayVriG.png) * Depending on these states, the feature vector P is subdivided into 3 subsets Pd, Ps, Pb. * Define a variable Kp which is true if p is an interest point and false if p is not an interest point. * Use the ID3 algorithm (decision tree classifier) to query each subset using the variable Kp for the knowledge about the true class. * The ID3 algorithm works on the principle of entropy minimization. Query the 16 pixels in such a way that the true class is found (interest point or not) with the minimum number of queries. Or in other words, select the pixel x, which has the most information about the pixel p. ![](https://i.imgur.com/NO50L8w.png) * This is recursively applied to all the subsets until its entropy is zero. * The decision tree so created is used for fast detection in other images. ### Non-maximal Suppression Detecting multiple interest points in adjacent locations is another problem. It is solved by using Non-maximum Suppression. * Compute a score function, V for all the detected feature points. V is the sum of absolute difference between p and 16 surrounding pixels values. * Consider two adjacent keypoints and compute their V values. * Discard the one with lower V value. ---------------------------------- ## BRIEF (Binary Robust Independent Elementary Features) BRIEF is very fast both to build and to match.BRIEF easily outperforms other fast descriptors such as SURF and SIFT in terms of speed and terms of recognition rate in many cases. ![](https://i.imgur.com/ehC4QLc.png) It provides a shortcut to find the binary strings directly without finding descriptors. It takes smoothened image patch and selects a set of n<sub>d</sub> (x,y) location pairs in an unique way (explained in paper). Then some pixel intensity comparisons are done on these location pairs. For eg, let first location pairs be p and q. If I( p)< I(q), then its result is 1, else it is 0. This is applied for all the n<sub>d</sub> location pairs to get a n<sub>d</sub>-dimensional bitstring. This n<sub>d</sub> can be 128, 256 or 512. OpenCV supports all of these, but by default, it would be 256 (OpenCV represents it in bytes. So the values will be 16, 32 and 64). So once you get this, you can use Hamming Distance to match these descriptors. One important point is that BRIEF is a feature descriptor, it doesn’t provide any method to find the features. So you will have to use any other feature detectors like FAST,SIFT, SURF etc. The paper recommends to use CenSurE which is a fast detector and BRIEF works even slightly better for CenSurE points than for SURF points. In short, BRIEF is a faster method feature descriptor calculation and matching. It also provides high recognition rate unless there is large in-plane rotation. -------------------------------- ## ORB (Oriented FAST and Rotated BRIEF) ORB builds on the well-known FAST keypoint detector and the BRIEF descriptor. Both of these techniques are attractive because of their good performance and low cost. **FAST** is Features from Accelerated Segment Test used to detect features from the provided image. It also uses a pyramid to produce multiscale-features. Now it doesn’t compute the orientation and descriptors for the features, so this is where BRIEF comes in the role. ORB uses **BRIEF** descriptors but as the BRIEF performs poorly with rotation. So what ORB does is to rotate the BRIEF according to the orientation of keypoints. Using the orientation of the patch, its rotation matrix is found and rotates the BRIEF to get the rotated version. ORB is an efficient alternative to SIFT or SURF algorithms used for feature extraction, in computation cost, matching performance, and mainly the patents.ORB was conceived mainly because SIFT and SURF are patented algorithms. ORB, however, is free to use. Paper link: http://www.willowgarage.com/sites/default/files/orb_final.pdf