---
title: Computer Vision Lecture Notes
tags: Computer Vision
---
# Computer Vision
<!-- Put the link to this slide here so people can follow -->
- Image Processing Basis
- Segmentation
- Local Featuring & Matching
- Object Recognition and Categorization
- Deep Learning
- 3D Reconstruction
---
## Image Processing Basis
- Image Formation
- Linear Filters
- Edge & Structure Extraction
- Color
---
### Linear Filters
- **Form a new image whose pixels are a weighted sum of original pixel values.**
- Application: Smoothing
- Common Types of Noise: (Noise is "Independent, Identically Distributed")
- Salt & Pepper Noise: 隨機發生黑色或白色的Pixel
- Impulse Noise: 隨機發生白色的Pixel
- Gaussion Noise: Variations in Intensity Drawn from a Gaussion (Normal) Distribution. (最常見)
- Assumption:
- Expect Pixel to be like neighbors.
- Expect Noise Pixel are i.i.d.
---
#### Correlation Filtering: (Cross Correlation: <img src="https://i.imgur.com/8XpCfzo.png" width="100"/>)
- Average Window Size: 2k+1 x 2k+1
<img src="https://i.imgur.com/WyDgf99.png" width="500"/>
- Now generalize to different weight:
<img src="https://i.imgur.com/ASGEqpl.png" width="500"/>
---
#### Convolution
1. Flip the filter **in both dimensions** (bottom to top, right to left)
2. Then apply cross-correlation
<img src="https://i.imgur.com/Sf0MUg1.png" width="500"/>
- **If H[-u,-v] = H[u,v], then correlation = convolution.**
---
#### Gaussian Smoothing
- Gaussian kernel:
<img src="https://i.imgur.com/w7fuv16.png" width="300"/>
- Rotationally Symmetric
- Parameters that matters:
- Sigma of Gaussian: Determine the Extent
<img src="https://i.imgur.com/prnZTKG.png" width="300"/>
- Size of kernel or mask
<img src="https://i.imgur.com/qMxzrtR.png" width="450"/>
- **Rule of thumb: set filter half-width (k) to about 3σ!**:
So kernal Size will be: 2*(3σ)+1
- Implementation:
<img src="https://i.imgur.com/7azREBe.png" width="500"/>
---
#### Boundary Issues
- Need to extrapolate (先加上邊邊,做完smoothing再去掉)
- Clip filter (black): 延伸周圍為黑色
- Wrap around: 延伸為另一頭的pixel
- Copy edge
- Reflect across edge
---
#### Fourier Transforms
- Sine and cosine transform to “frequency spikes”
- A Gaussian transforms to a Gaussian (Symmetric!)
- A box filter transforms to a sinc <img src="https://i.imgur.com/qoMEAWC.png" width="200"/> (Symmetric!)
- Duality: The better a function is localized in one domain, the worse it is localized in the other.
<img src="https://i.imgur.com/dBwLs1T.png" width="400"/>
- Convolving two functions in the image domain corresponds to the product of their transformed versions in the frequency domain. <img src="https://i.imgur.com/m4VLE6x.png" width="130"/>
- **Effect of Filtering**:
<img src="https://i.imgur.com/C7UUmbu.png" width="550"/>
- Sharpening Filter
<img src="https://i.imgur.com/eKRmPEy.png" width="350"/> = <img src="https://i.imgur.com/0E9Q8I2.png" width="100"/>
---
### Non-Linear Filters: Median Filter
- Replace each pixel by the median of its neighbors.
- **Removes spikes: good for impulse, salt & pepper noise.**
<img src="https://i.imgur.com/5XQaY2o.png" width="300"/>
- The Median filter is **edge preserving**.
---
### Multi-Scale representations
- Re-Sampling Problem:
<img src="https://i.imgur.com/EkLak4v.png" width="450"/>
- Sampling in the spatial domain is like **multiplying with a spike function.**
- Sampling in a frequency domain is like **Convolving** with a spike function.
- Sampling & Aliasing(混疊)
<img src="https://i.imgur.com/w4gczJh.png" width="500"/>
#### **Nyquist theorem**: (Exercise 1.2)
- **In order to recover a certain frequency f, we need to sample with at least 2f.**
- This corresponds to the point at which the transformed frequency spectra start to overlap (the Nyquist limit).
- Resampling with Prior Smoothing:
- Note: We cannot recover the high frequencies, but **we can avoid artifacts by smoothing before resampling.**
#### The Gaussian Pyramid: Stored Information
- Important for describing and searching an image at all scales.
- Construction: create each level from previous one
- Smooth and sample
- Smooth with Gaussians, in part because
- a Gaussian * Gaussian = another Gaussian
- Gaussians are low-pass filters, so the representation is redundant once smoothing has been performed.
- There is no need to store smoothed images at the full original resolution.
<img src="https://i.imgur.com/K8Bwgze.png" width="500"/>
<img src="https://i.imgur.com/mzmaDES.png" width="500"/>
#### The Laplacian Pyramid
- Contain "high-pass image" and "gradient information"
- Image Compression: use it in order to store image as a much reduce bit rate
- Laplacian = Difference of Gaussian (DoG): same as laplacian filter for edge expression
- Cheap approximation – no derivatives needed.
<img src="https://i.imgur.com/b2lj9E1.png" width="500"/>
<img src="https://i.imgur.com/1uxpLwM.png" width="500"/>
#### Filters are Templates
- Applying a filter at some point can be seen as taking a dotproduct between the image and some vector.
- Filters find effects they look like.
<img src="https://i.imgur.com/PLLxsBO.jpg" width="500"/>
---
### Image Gradients: Derivative of Gaussion
<img src="https://i.imgur.com/tTGudXp.png" width="400"/>
- Differentiation of Image:
- <img src="https://i.imgur.com/Jo8HiyQ.png" width="400"/>
- To implement as a **Convolution**:
- 因為Convolution要Flip,所以[-1,1]要從[1,-1]變過來
- Assorted Finite Difference Filters:
<img src="https://i.imgur.com/0nZYoc3.png" width="400"/>
- The Gradient of an Image: <img src="https://i.imgur.com/O6IbiJJ.png" width="200"/>
- Points to the direction of most rapid intensity change
<img src="https://i.imgur.com/TSu63VD.png" width="500"/>
- The Gradient Direction: <img src="https://i.imgur.com/LGdDuli.png" width="200"/>
- The Gradient Strength: <img src="https://i.imgur.com/jYMlyQF.png" width="250"/>
- Problem: Effect of Noise
<img src="https://i.imgur.com/iWiiwyu.png" width="300"/>
- Solution: Smoothing First
<img src="https://i.imgur.com/4tg6EqT.png" width="400"/>
- We can first do the derivative of gaussion smoothing <img src="https://i.imgur.com/bwdfaG6.png" width="250"/>
<img src="https://i.imgur.com/EaJHQKD.png" width="400"/>
- **Advantage? We no longer have to differentiate(微分) the image again** (We save the deriavtive opperation entirely) => Now we just convolve with a second filter
<img src="https://i.imgur.com/k2su0JE.png" width="400"/>
#### Derivative of Gaussian Filter
- We can think of the **"Sobel Operator"** as a derivative of Gaussian
#### Laplacian of Gaussian Filter
<img src="https://i.imgur.com/XicUdrL.jpg" width="500"/>
<img src="https://i.imgur.com/fdlmaVE.png" width="300"/>
---
### Edge Detection: Canny Edge Detector
- Goal: Map image from 2D array of pixels to a set of line segments or contours, and has more edge fragments that is still capturing the essence of image
<img src="https://i.imgur.com/pY2DCIi.png" width="350"/>
- Main Idea: Look for strong gradients
---
#### Good Detection:
- Minimize the probability of false positives / negatives
#### Good Localization:
- should as close as the true edges
#### Single Response:
- should return one point only for each true edge point
<img src="https://i.imgur.com/v8iXCqU.png" width="450"/>
#### Gradients -> Edges:
1. Smoothing: suppress noise
2. Edge enhancement: Filter for contrast
3. Edge Localization:
<img src="https://i.imgur.com/E7dHTNI.png" width="500"/>
- Scale: Effect of sigma
- Larger sigma: Detect large edge
- Smaller sigma: Detect finer feature
- Sensitivity: Compare to Thresholding
1. Choose a Threshold t
2. 小於 t 則設為 0
3. 大於等於 t 則為 1
<img src="https://i.imgur.com/OwlpCdv.png" width="200"/>
#### Canny Edge Detector
<img src="https://i.imgur.com/r8RQiqS.jpg" width="500"/>
- Non-Maximum Suppression
<img src="https://i.imgur.com/RaeApvg.png" width="500"/>
- Check if a pixel is local maximum along gradient direction, select single max across width of the edge.
- Problem: Some pixel didn't survive from the thresholding
- Solution: Hysteresis Thresholding
- Maintain 2 Threshold: K_high, K_low
- Use K_high to find strong edges to start edge chain
- Use K_low to find weak edges which continue edge chain
- Typical ratio: K_high / K_low = 2

---
### Structure Extraction
- Edges re useful signals to indicate occluding boundaries, shapes.
- But sometimes boundaries of interest are fragmented, and we have extra "clusters"
#### Fitting
- We want to associate a model with observed feature.
- Model could be a line, circle , car, ...
- EX: Line Fitting
- Difficulty of Line Fitting:
- Extra edge points (clutter)
- only some part of the line are detected
- noise in orientation
<img src="https://i.imgur.com/VXrZyGT.png" width="200"/>
#### Hough Transform:
- A voting transform
- Main idea:
1. Vote for all possible lines on which each edge point could lie.
2. Look for line candidates that can getmany votes.
3. Noise features will get votes, but should be inconsistent.
- Finding Line in an Image: Hough Space
<img src="https://i.imgur.com/y4iNv0M.png" width="450"/>
<img src="https://i.imgur.com/7rVOUpx.png" width="450"/>
- Basic Idea for Voting:
- Pick bins that have the highest votes
<img src="https://i.imgur.com/8WUPaBV.png" width="450"/>
- Polar representation of a Line:
<img src="https://i.imgur.com/Sy3704C.png" width="300"/>
- Hough Transform Algorithm:
- Using Polar parameterization <img src="https://i.imgur.com/FLGjVc5.png" width="150"/>
<img src="https://i.imgur.com/ylK17xM.png" width="250"/>
- Algorithm:
<img src="https://i.imgur.com/Lv48jMz.png" width="500"/>
- Example:
<img src="https://i.imgur.com/iSk2blo.png" width="400"/>
- Impact of Noise:
<img src="https://i.imgur.com/Tak2wBm.png" width="380"/>
<img src="https://i.imgur.com/6GMoGlD.png" width="375"/>
- If we choose Bin Size too small: we may miss some important components
- We should choose larger Bin Size, but we may increase the chance capturing noise
- Extension to improve:
<img src="https://i.imgur.com/LYjvgLu.jpg" width="500"/>
#### Hough Place for Circle:
<img src="https://i.imgur.com/83lpjlo.jpg" width="500"/>
<img src="https://i.imgur.com/j4o6BH2.png" width="500"/>
<img src="https://i.imgur.com/DsQuM0J.png" width="500"/>
- Algorithm for Circle:
<img src="https://i.imgur.com/kXdJl4N.png" width="350"/>
- Example
<img src="https://i.imgur.com/P1uK8O4.jpg" width="500"/>
##### Pros
- All points are processed independently, so can cope with occlusion
- Some robustness to noise: noise points unlikely to contribute consistently to any single bin
- Can detect multiple instances of a model in a single pass
##### Cons
- Complexity of search time increases exponentially with the number of model parameters
- Non-target shapes can produce spurious peaks in parameter space
- Quantization: hard to pick a good grid size
---
## Segmentation
- Segmentation as Clustering
- Graph-theoretic Segmentation
---
### Segmentation as Clustering
- Gestalt Theory
- Elements in a collection can have properties that result from relationships
- "The whole is greater than the sum of its parts"
- Relationships among parts can yield new properties/features
- Image Segmentation
- Goal: identify groups of pixels that go together
<img src="https://i.imgur.com/zOvOZt6.jpg" width="500"/>
---
### Segmentation and Grouping
- Example:
- Simple Image:
<img src="https://i.imgur.com/VIPr6WI.png" width="500"/>
- With Noise:
<img src="https://i.imgur.com/7rqxfE4.png" width="500"/>
- Solution:<img src="https://i.imgur.com/3ZshZh7.png" width="250"/>
- Goal: choose three “centers” as the representative intensities, and label every pixel according to which of these centers it is nearest to.
- Clustering:
<img src="https://i.imgur.com/cKj0a9B.png" width="500"/>
- If we knew the cluster centers, we could allocate points to groups by assigning each to its closest center.
- If we knew the group memberships, we could get the centers by computing the mean per group.
- **However, we don't know either of the two.**
- K-Means Clustering
- Algorithm:
1. Randomly initialize the cluster centers, c1, ..., cK
2. Given cluster centers, determine points in each cluster
– For each point p, find the closest ci. Put p into cluster i
3. Given points in each cluster, solve for ci
– Set ci to be the mean of points in cluster i
4. If ci have changed, repeat Step 2
- Does not always find the global minimum of objective function.
- We have no garantee to the solution
- Pros:
- Simple, fast to compute (Converge Quickly)
- Converges to local minimum of within-cluster squared error
- Cons:
- Setting k?
- Sensitive to initial centers
- Sensitive to outliers
- Detects spherical clusters only
- Assuming means can be computed
<img src="https://i.imgur.com/KbtJNlp.png" width="300"/>
- K-Means++
- Algorithm:
1. Randomly choose first center.
2. Pick new center with prob. proportional to <img src="https://i.imgur.com/nhGvBf1.png" width="100"/>
- (Contribution of p to total error)
3. Repeat until k centers.
- <img src="https://i.imgur.com/zvdcUuq.png" width="300"/>
- 讓Center之間越遠越好
- Feature Space:
- Intensity (1D)
- Coler: RGB (3D)
- Texture: Filter bank responses (e.g. 24D, when we have 24 filters)
- **Assigning a cluster label per pixel may yield outliers**
<img src="https://i.imgur.com/SFMssA9.png" width="400"/>
- Grouping pixels based on **intensity+position** similarity
- Weighting Problem: We need to know is it more important for **appearance** or **spatial**
---
### Probabilistic Clustering
- K-means doesn’t answer these questions
- What’s the probability that a point x is in cluster m?
- What’s the shape of each cluster?
- Treating Data:
- 從 Bunch of points
- 到 points generated by sampling a continuous function
- Function called: Generative Model.
- One generative model is a **mixture of Gaussians (MoG)**
- Mixture of Gaussians (MoG):
<img src="https://i.imgur.com/0WVO6u8.png" width="300"/>
<img src="https://i.imgur.com/3KYkaXf.png" width="500"/>
- Expectation Maximization (EM):
- Goal: Find blob parameters µ that maximize the likelihood function
<img src="https://i.imgur.com/mrMr29X.png" width="200"/>
- Approach:
1. E-step: given current guess of blobs, compute ownership of each point
2. M-step: given ownership probabilities, update blobs to maximize likelihood function
3. Repeat until convergence
- Turns out this is useful for all sorts of problems.
- Pros
- **Probabilistic interpretation**
- **Soft assignments between data points and clusters**
- **Generative model, can predict novel data points**
- Relatively compact storage
- Cons
- Local minima
- k-means is NP-hard even with k=2
- Initialization
- **Often a good idea to start with some k-means iterations. (To speed up the process)**
- Need to know number of components
- Solutions: model selection (AIC, BIC), Dirichlet process mixture
- **Need to choose generative model**
- Numerical problems are often a nuisance
- MoG Color Models for Image Segmentation
- User assisted image segmentation
- User marks two regions for foreground and background.
- Learn a MoG model for the color values in each region.
- Use those models to classify all other pixels.
<img src="https://i.imgur.com/XHIcjKd.png" width="400"/>
---
### Modal-Free Clustering: Mean-Shift clustering
- Previous are bad to initialize
- An advanced and versatile technique for clustering-based segmentation
- Iterative Mode Search
1. Initialize random seed, and window W
2. Calculate center of gravity (the “mean”) of W: <img src="https://i.imgur.com/L38Cu54.png" width="100"/>
3. Shift the search window to the mean
4. Repeat Step 2 until convergence
<img src="https://i.imgur.com/pRjWOIC.png" width="500"/>
<img src="https://i.imgur.com/BKEpltt.png" width="500"/>
- **Real Modality Analysis:**
- Tessellate the space with windows
- Run the procedure in parallel
<img src="https://i.imgur.com/NCdD186.png" width="500"/>
- Cluster: all data points in the attraction basin of a mode
- Attraction basin: the region for which all trajectories(軌道) lead to the same mode
- Merge windows that end up near the same “peak” or mode.
- **Problem: Computational Complexity**
- **Speedups: Basin of Attraction**
- Assign all points within radius r of end point to the mode.
<img src="https://i.imgur.com/p4KCWmR.png" width="500"/>
- Assign all points within radius r/c of the search path to the mode.
<img src="https://i.imgur.com/GntVY6d.png" width="500"/>
- Pros
- General, application-independent tool
- Model-free, does not assume any prior shape (spherical, elliptical, etc.) on data clusters
- Just a single parameter (window size h)
- h has a physical meaning (unlike k-means)
- Finds variable number of modes
- Robust to outliers
- Cons
- Output depends on window size
- Window size (bandwidth) selection is not trivial
- Computationally (relatively) expensive
- Does not scale well with dimension of feature space
---
### Segmentation as Energy Minimization
- Markov Random Fields
- Learn local effects, get global effects out
- In real images, regions are often homogenous; neighboring pixels usually have similar properties (intensity, color, texture, …)
- Markov Random Field (MRF) is a statistical model which captures such contextual constraints
- **Each node can be Observation or Unobserved**
- **Each node is some random variable**
- Connection: There is some relation between the randm variables
- 給observed的y,算出最有可能的hidden的x
- **Function 決定How the observed value transfer to the true value**
<img src="https://i.imgur.com/bU0Jjbr.png" width="500"/>
- Example:
- How the observed black pixel corresponds to a true black pixel
<img src="https://i.imgur.com/68zdpB3.png" width="500"/>
#### Labelling
1. Extract features from the input image
- Each pixels in the image has a feature vector
- For the whole image, we have: <img src="https://i.imgur.com/70jpFc3.png" width="170"/>
2. Assign each pixels a label
- For the whole image, we have: <img src="https://i.imgur.com/LW2pG0a.png" width="80"/>
<img src="https://i.imgur.com/frMhQ0O.png" width="170"/>
- For an nXm image, there are <img src="https://i.imgur.com/ewzPJmK.png" width="50"/> possible labelings
- Define a probability measure on the set of all possible labelings and select the most likely one.
- P(ω | f) measures the probability of a labelling, given the observed feature f
- Our goal is to find an optimal labeling which maximize P(ω | f)
- For each pixel, we can define some surrounding pixels as its neighbors.
#### Network Joint Probability
<img src="https://i.imgur.com/5Tm0Y9H.png" width="400"/>
#### Energy formulation
- Joint probability <img src="https://i.imgur.com/emVnzn7.png" width="300"/>
- Maximizing the joint probability is the same as minimizing the negative logarithm of it. (乘法變加法算極值)
<img src="https://i.imgur.com/fxFIVz8.png" width="500"/>
- This is similar to free-energy problems in statistical mechanics (spin glass theory). We therefore draw the analogy and call E an **energy function**.
- <img src="https://i.imgur.com/ISWkbdT.png" width="230"/>
- **We want to minimize the Energy: Can be looked as giving penalty for each assignment.**
<img src="https://i.imgur.com/ao771sG.png" width="400"/>
- Single-node potentials (“unary potentials”)
- Encode local information about the given pixel/patch
- How likely is a pixel/patch to belong to a certain class
(e.g. foreground/background)?
- Pairwise potentials: Encode neighborhood information
- How different is a pixel/patch’s label from that of its neighbor?
(e.g. based on intensity/color/texture difference, edges)
- EX: Assign a penalty whenever there is a neighbor change.
---
### Graph cuts for Energy Minimization / image segmentation
- Basic idea
- s-t Mincut algorithm
- Extension to non-binary case
#### Graph Cuts for Optimal Boundary Detection
- Idea: convert MRF into a source-sink graph
<img src="https://i.imgur.com/byQzJaC.png" width="500"/>
- For each pair of points we assign a weight that looks at the differece in intenisty between two pixels
- We want to have a segmentation that is aligned to intensity boundaries
- We first define the fore-ground (blue) and back-ground (red) label.
- We want to get a cut, such that the sum of the cost of all the edges are minimum.
<img src="https://i.imgur.com/SgsZxFE.png" width="500"/>
- Now we map this problem to the Energy Minimization Problem.
##### Simple Example of Energy
<img src="https://i.imgur.com/iBmvkjE.png" width="500"/>
##### Adding Regional Properties
- Now we assign each pixel to source and sink: t-link
<img src="https://i.imgur.com/dF7TJRZ.png" width="500"/>
- Our cut will need to cut each node to either the source or the sink.
- In order to map transfer this to Energy Minimization, we need to map the unary term to those t-links.
<img src="https://i.imgur.com/FqQuhDQ.png" width="350"/>
**Note: 𝜙(𝑡) is the cost for assigning the link to the s node!**
##### How to Set the Potentials? Some Examples
- Then we can assign the potential to the corresponding links of the graph.
- For Unary Potential:
- The likelihood of the pixel belonging to each of the different class
<img src="https://i.imgur.com/sxi9flv.png" width="500"/>
- For Pair Wise Potential:
- Set such as to discourage label changes: get few label changes
- Getting low penalty if the label is the same
<img src="https://i.imgur.com/XFi0vfN.png" width="500"/>
##### How Does the Code Look Like?
<img src="https://i.imgur.com/gPbYWPt.png" width="600"/>
#### Example: MRF for Image Segmentation
<img src="https://i.imgur.com/OZ4X41y.png" width="600"/>
#### How the actual Max-Flow Algorithm are solved: S-T Mincut Algorithm
- What is an st-cut?
- An st-cut (S,T) divides the nodes between source and sink.
- What is the cost of a st-cut?
- Sum of cost of all edges going from S to T
<img src="https://i.imgur.com/dJiSb65.png" width="150"/>
- What is the st-mincut?
- st-cut with the minimum cost
1. First we define a graph
<img src="https://i.imgur.com/Pesh7FQ.png" width="600"/>
2. Since there is a duality between the s-t mincut and the maximum flow problem. So we can compute the Maximum flow between source and sink.
- Some Constrain
<img src="https://i.imgur.com/nvQdHqW.png" width="300"/>
##### Max-Flow Algorithm
<img src="https://i.imgur.com/TiYr6wm.png" width="500"/>
<img src="https://i.imgur.com/T8mANP2.png" width="200"/>
<img src="https://i.imgur.com/y6JE04L.png" width="188"/>
<img src="https://i.imgur.com/4J4BkQf.png" width="200"/>
<img src="https://i.imgur.com/05gVnZG.png" width="200"/>
<img src="https://i.imgur.com/BAqkGJx.png" width="200"/>
<img src="https://i.imgur.com/ByA7SoE.png" width="200"/>
<img src="https://i.imgur.com/s2HiWIg.png" width="200"/>
<img src="https://i.imgur.com/MqlJNw1.png" width="177"/>
<img src="https://i.imgur.com/NaPMJSv.png" width="210"/>
##### When Can s-t Graph Cuts Be Applied?
<img src="https://i.imgur.com/v6tc7xs.png" width="500"/>
- The Energy needs to prefer regions to have the same label.
- If the energy is not **submodular**: We can't apply **graph cut**
- If it is not binary: still have chance
#### Dealing with Non-Binary Cases
- We would like to solve also multi-label problems.
- The bad news: Problem is NP-hard with 3 or more labels!
- There exist some approximation algorithms which extend graph cuts to the multi-label case:
- alpha-Expansion
- alpha-beta-Swap
- They are no longer guaranteed to return the globally optimal result.
#### Alpha-Expansion Move
<img src="https://i.imgur.com/oyWK1Jz.png" width="400"/>
1. Set one label to Alpha
2. Find best cut that seperates Alpha
---
### Applications
- Interactive segmentation
---
## Recognition & Categorization
- Object Recognition and Categorization
- Problem Definitions
- Challenges
- Silding-Window Object Detection
- Detection via Classification
- Global Representations
- Classifier Construction
---
### Object Recognition:
#### Challenges (3D)
- Viewpoint changes
- Translation
- Image-plane rotation
- Scale changes
- Out-of-plane rotation
- Illumination
- Noise
- Clutter
- Occlusion
---
#### Appearance-Based Recognition
- Basic assumption
- Objects can be represented by a set of images (“appearances”).
- For recognition, it is sufficient to just compare the 2D appearances.
- No 3D model is needed.
- Represent an Object as a set of image
##### Global Representation
- Represent each object (view) by a global descriptor.
<img src="https://i.imgur.com/wnRJlv1.png" width="500"/>
- For recognizing objects, just match the descriptors.
- Building up a database of many descriptors.
##### Appearance based Recognition
- Recognition as feature vector matching
<img src="https://i.imgur.com/ZyvdtyZ.png" width="500"/>
---
### Silding-Window Object Detection
- Detection via Classification
- Global Representations
- Classifier Construction
---
#### Detection via Classification:
- Main Idea:
- Basic component: a binary classifier
- Yes / No, this is / isn't a car.
- If the object may be in a cluttered scene, slide a window around looking for it.
- Essentially, this is a brute-force approach with many local decisions.
<img src="https://i.imgur.com/IcM732v.png" width="400"/>
- What is a Sliding Window Approach? Search over space and scale
- We need to:
1. Obtain training data
2. Define features
3. Define classifier
- Feature Extraction: Global Appearance
- Pixel-based representations are sensitive to small shifts
<img src="https://i.imgur.com/r6fsS3u.png" width="500"/>
- Color or grayscale-based appearance description can be sensitive to illumination and intra-class appearance variation.
<img src="https://i.imgur.com/kEsh1mD.png" width="300"/>
- Gradient-based Representations
- Consider edges, contours, and (oriented) intensity gradients
<img src="https://i.imgur.com/uKoJFoy.png" width="500"/>
- Summarize local distribution of gradients with histograms
- Collect information from sub-window
- Locally orderless: offers invariance to small shifts and rotations
- Localized histograms offer more spatial information than a single global histogram (tradeoff invariant vs. discriminative)
- Contrast-normalization: try to correct for variable illumination
<img src="https://i.imgur.com/TdIdkN4.jpg" width="500"/>
- **Histograms of Oriented Gradients (HoG)**
<img src="https://i.imgur.com/svHVgrZ.png" width="500"/>
#### Classifier Construction
- Discriminative Methods
- Learn a decision rule (classifier) assigning image features to different classes
<img src="https://i.imgur.com/7amSFOm.jpg" width="500"/>
- We want to find decision boundary
- Ways:
<img src="https://i.imgur.com/XN4ybyf.png" width="500"/>
### Classification with SVMs
- Discriminative classifier based on optimal separating hyperplane (i.e. line for 2D case)
- Maximize the **margin** between the positive and negative training examples
<img src="https://i.imgur.com/Ax6QKkL.png" width="300"/>
<img src="https://i.imgur.com/sbSqWL2.png" width="600"/>
- Finding the Maximum Margin Line
<img src="https://i.imgur.com/m5rCj9g.png" width="300"/>
<img src="https://i.imgur.com/q8H76Tx.png" width="600"/>
#### Extension: Non-Linear SVMs
<img src="https://i.imgur.com/8Yu2B95.png" width="600"/>
<img src="https://i.imgur.com/uIr3CoV.png" width="600"/>
##### Some Often-Used Kernel Functions
<img src="https://i.imgur.com/U5LCShS.png" width="400"/>
<img src="https://i.imgur.com/UjCn28Y.png" width="500"/>
#### HOG Detector
- HOG Descriptor Processing Chain
<img src="https://i.imgur.com/iwmkerU.png" width="500"/>
### Boosting
- design for real-time face detection
<img src="https://i.imgur.com/WPaMAzY.png" width="500"/>
<img src="https://i.imgur.com/CeqaOxK.png" width="400"/>
<img src="https://i.imgur.com/NitSFlF.png" width="500"/>
#### Example: Face Detection
- Frontal faces are a good example of a class where global appearance models + a sliding window detection approach fit well:
- Regular 2D structure
- Center of face almost shaped like a “patch”/window
- **Feature extraction:** Integral Image
<img src="https://i.imgur.com/l5K0uIx.png" width="500"/>
- 有了Integral Image值的定義以後,對於一張Image我們會有小小的白色灰色方塊的Feature,這些Feature我們稱之為rectangle feature。
- 這些rectangle feature的大小不固定,但是白色的方塊一定和灰色的方塊一樣大。
- 這些方塊會像Filter一樣,在一張Image之中移動。而對於每一個Feature值的計算,就是將白色的Integral的值減掉灰色方框Integral的值。
- Large Library of Features
<img src="https://i.imgur.com/3yanx3J.png" width="500"/>
- **AdaBoost for Feature+Classifier Selection**
<img src="https://i.imgur.com/DakKwpI.png" width="500"/>
- **Cascading Classifiers for Detection**
- For efficiency, apply less accurate but faster classifiers first to immediately discard windows that clearly appear to be negative
- 對於Cascade classifier的概念,就如Figure 3所示。我們一開始將feature分成好幾個classifier。最前面的classier辨識率最低,但是可以先篩選掉很大一部份不是人臉的圖片;接下來的Classier處理比較難處理一點的case篩選掉的圖片也不如第一個classier多了;依此下去,直到最後一個classier為止。最後留下來的就會是我們想要的人臉的照片。

- Filter for promising regions with an initial inexpensive classifier
- Build a chain of classifiers, choosing cheap ones with low false negative rates early in the chain
- Viola-Jones Face Detector
<img src="https://i.imgur.com/M64qR8L.png" width="500"/>
#### Limitations: Changing Aspect Ratios
- Sliding window requires fixed window size
- Aspect ratios?
- Fixed window size
- Wastes training dimensions
- Adapted window size
- Difficult to share features
- “Squashed” views [Dalal&Triggs]
- Need to squash test image, too
- Not all objects are “box” shaped
---
## Local Features & Matching
- Local Features – Detection and Description
- Recognition with Local Features
### Local Features – Detection and Description
- Motivation:
- Global representations have major limitations
- Instead, **describe and match only local regions**
- Application: Image Stitching
- Detect feature points in both images
- Find corresponding pairs
<img src="https://i.imgur.com/wRyEcQt.jpg" width="500"/>
- Use these pairs to align the images
<img src="https://i.imgur.com/J8tEO1h.png" width="500"/>
- General Approach:
<img src="https://i.imgur.com/RrCBq30.png" width="500"/>
- Problems;
<img src="https://i.imgur.com/Rl8KS13.png" width="470"/>
<img src="https://i.imgur.com/QVNqHBp.png" width="500"/>
#### Requirements
- Region extraction needs to be **repeatable** and **accurate**
- **Invariant** to translation, rotation, scale changes
- **Robust or covariant** to out-of-plane (affine) transformations
- **Robust** to lighting variations, noise, blur, quantization
- **Locality**: Features are local, therefore robust to occlusion and clutter.
- **Quantity**: We need a sufficient number of regions to cover the object.
- **Distinctiveness**: The regions should contain “interesting” structure.
- **Efficiency**: Close to real-time performance.
### Keypoint Localization
- Given two version of the same image, we want to find the same key point
- Keypoint detection:在圖片中取得感興趣的關鍵點(可能為edges、corners或blobs)。
- Feature extraction:針對各關鍵點提取該區域的features(我們稱為local features)。
- We want to find keypoints **independently**
- [圖像特徵比對(一)-取得影像的特徵點](https://chtseng.wordpress.com/2017/05/06/%E5%9C%96%E5%83%8F%E7%89%B9%E5%BE%B5%E6%AF%94%E5%B0%8D%E4%B8%80-%E5%8F%96%E5%BE%97%E5%BD%B1%E5%83%8F%E7%9A%84%E7%89%B9%E5%BE%B5%E9%BB%9E/)
<img src="https://i.imgur.com/l6T1nm8.png" width="500"/>
- Goals:
- Repeatable detection
- Precise localization
- Interesting content
- **Look for two-dimensional signal changes**
- **Finding Corners**: In the region around a corner, image gradient has two or more dominant directions
- Corners are repeatable and distinctive
<img src="https://i.imgur.com/56Y3bNn.png" width="500"/>
<img src="https://i.imgur.com/ObfZv8l.png" width="500"/>
- Design criteria
- We should easily recognize the point by looking through a **small window (locality)**
- Shifting the window in **any direction** should give a large change in intensity **(good localization)**
#### Harris Detector Formulation
- Finding corner
- Compute Second-Moment Matrix
- Harris角點檢測通過窗口在各個方向上的變化程度,決定是否存在著角點。
<img src="https://i.imgur.com/9EMC6TM.png" width="500"/>
<img src="https://i.imgur.cm/NoTftLt.png" width="500"/>
>$\mathbf{A} \mathbf{v} = \lambda \mathbf{v}$
> 其中 λ 為一純量,稱為 v 對應的特徵值。也稱 v 為特徵值 λ 對應的特徵向量。也即特徵向量被施以線性變換 A 只會使向量伸長或縮短而其方向不被改變。
- Classification of Eigenvalues
<img src="https://i.imgur.com/hfxRX8C.png" width="500"/>
- Corner Response Function
- Gives us larger value for two large eigen values
- smaller value for edges
<img src="https://i.imgur.com/db0zrQL.png" width="500"/>
#### Window Function
- Window Function w(x,y)
- Option 1: uniform window
- Sum over square window
- Option 2: Smooth with Gaussian
- Gaussian already performs weighted sum
- Result is rotation invariant
#### Summary: Harris Detector
<img src="https://i.imgur.com/tAoEdLN.png" width="500"/>
- Rotation invariance?
<img src="https://i.imgur.com/8YGbHcf.png" width="500"/>
- Scale invariance?
<img src="https://i.imgur.com/VbaLVVM.png" width="500"/>
#### Hessian Detector
- We first build Second-Derivative
<img src="https://i.imgur.com/12arKBe.png" width="500"/>
- **Intuition: Search for strong derivatives in two orthogonal directions**
<img src="https://i.imgur.com/FomcnAj.png" width="500"/>
- Effect: Responses mainly on corners and strongly textured areas.
<img src="https://i.imgur.com/eb2cdOm.png" width="400"/>
#### From Points to Regions...
- The Harris and Hessian operators define interest points.
- Preciselocalization
- Highrepeatability
- **In order to compare those points, we need to compute a descriptor over a region.**
<img src="https://i.imgur.com/X2WI89m.png" width="300"/>
##### Naïve Approach: Exhaustive Search
- Comparing descriptors while varying the patch size
<img src="https://i.imgur.com/nLZEv0A.png" width="500"/>
##### Automatic Scale Selection
- Design a signature function on the region that is “scale invariant” (the same for corresponding regions, even if they are at different scales)
- For a point in one image, we can consider it as a function of region size (patch width)
- Commonapproach:
- Take a local maximum of this function.
- Observation: region size for which the maximum is achieved should be invariant to image scale.
<img src="https://i.imgur.com/o832xf3.png" width="500"/>
<img src="https://i.imgur.com/6lCEZIm.png" width="500"/>
<img src="https://i.imgur.com/BlCVphA.jpg" width="500"/>
- Normalize: Rescale to fixed size
<img src="https://i.imgur.com/CILqF78.png" width="250"/>
- What Is A Useful Signature Function?
- Laplacian-of-Gaussian = “blob” detector


- We find the point that is local extrema with compare to the adjacent scale.
- We can use Difference of Gaussians instead of Laplacian.
#### Harris-Laplace
1. Initialization: Multiscale Harris corner detection
2. Scale selection based on Laplacian
(same procedure with Hessian => Hessian-Laplace)
<img src="https://i.imgur.com/U3QySJ3.jpg" width="500"/>
#### Summary: Scale Invariant Detection
- Given: Two images of the same scene with a large scale difference between them.
- Goal: Find the same interest points independently in each image.
- Solution: Search for maxima of suitable functions in scale and in space (over the image).
- Two strategies
- Laplacian-of-Gaussian(LoG)
- Difference-of-Gaussian (DoG) as a fast approximation
---
### Recognition with Local Features: LocalDescriptors

- SIFT
- Applications
---
#### SIFT (Scale Invariant Feature Transform)
- Descriptor computation:
- Divide patch into 4x4 sub-patches: 16 cells
- Compute histogram of gradient orientations (8 reference angles) for all pixels inside each sub-patch
- Resulting descriptor: 4x4x8=128 dimensions
<img src="https://i.imgur.com/sNEpRjH.png" width="500"/>
- Rotation Invariant Descriptors
- Find local orientation
- Dominant direction of gradient for the image patch
- Orientation Normalization: Computation
- Compute orientation histogram
- Select dominant orientation
- Normalize: rotate to fixed orientation
<img src="https://i.imgur.com/uvg03gN.png" width="500"/>
<img src="https://i.imgur.com/0DG6n7m.png" width="500"/>
#### Summary: SIFT
<img src="https://i.imgur.com/FuXWxY3.jpg" width="500"/>
#### Local Descriptors: SURF
- Fast approximation of SIFT idea
- 將 SIFT 的許多過程近似,達到加速的效果
- Efficient computation by 2D box filters & integral images
- 6 times faster than SIFT
- Equivalent quality for object identification
#### Applications of Local Invariant Features
- Wide-Baseline Stereo
<img src="https://i.imgur.com/PjXVAeN.jpg" width="500"/>
- Panorama Stitching
<img src="https://i.imgur.com/aavh3Og.jpg" width="500"/>
---
### Recognition with Local Features
- Matching local features
- Finding consistent configurations
- Alignment: linear transformations
- Affine estimation
- Homography estimation
---
Image content is transformed into local features that are invariant to translation, rotation, and scale
- Goal: Verify if they belong to a consistent configuration
<img src="https://i.imgur.com/6G9IcTE.png" width="500"/>
#### Warping vs. Alignment
<img src="https://i.imgur.com/n00P2sm.png" width="500"/>
#### Parametric (Global) Warping
<img src="https://i.imgur.com/zc7c73B.png" width="500"/>
<img src="https://i.imgur.com/d1zMPcy.png" width="500"/>
<img src="https://i.imgur.com/eTLrmEN.png" width="500"/>
<img src="https://i.imgur.com/prap2Ea.png" width="500"/>
- **Translation is not a linear operation in 2D**
- **Only linear 2D transformations can be represented with a 2x2 matrix.**
- How can we represent translation as a 3x3 matrix using homogeneous coordinates?
- A: Using the rightmost column:
<img src="https://i.imgur.com/mKUIPbL.png" width="300"/>
<img src="https://i.imgur.com/clO8mXA.png" width="500"/>
#### 2D Affine Transformations
- Affine transformations are combinations of ...
- Linear transformations
- Translations
- Parallel lines remain parallel
<img src="https://i.imgur.com/GMlcoIh.png" width="300"/>
#### Projective Transformations
- Affine transformations
- Projective warps
- Parallel lines do not necessarily remain parallel
<img src="https://i.imgur.com/rYrbZBX.png" width="300"/>
<img src="https://i.imgur.com/nxNGyui.png" width="500"/>
#### Fitting a Affine Transformation
<img src="https://i.imgur.com/dPtA4IG.png" width="500"/>
<img src="https://i.imgur.com/qqD1076.png" width="500"/>
#### Homography
- Transformations that map the view of plane to another plane by a projective transformation
<img src="https://i.imgur.com/ylUfzAE.png" width="400"/>
<img src="https://i.imgur.com/kFEY2Sg.png" width="400"/>
<img src="https://i.imgur.com/bZhXQag.png" width="500"/>
- We must Normalize the image coordinates.
<img src="https://i.imgur.com/2EtVSoU.png" width="500"/>
<img src="https://i.imgur.com/Pp8AT7w.png" width="500"/>
<img src="https://i.imgur.com/4DduKTA.png" width="500"/>
<img src="https://i.imgur.com/NzdBxyW.png" width="500"/>
<img src="https://i.imgur.com/5sQjc4W.jpg" width="500"/>