Epipolar Geometry

Epipolar Geometry === <style> figure { padding: 4px; margin: auto; text-align: center; } figcaption { background-color: black; color: white; font-style: italic; padding: 1px; text-align: center; } </style> Previously, we explored the computation of a camera's intrinsic and extrinsic parameters through one or more views using standard camera calibration or single-view metrology procedures. These methods allowed us to extract information about the 3D world from a single image. However, ==it is generally impractical to reconstruct the complete structure of the 3D world solely from a single image==. This limitation arises from the intrinsic ambiguity inherent in mapping 3D to 2D: certain details are inevitably lost. For example, the following picture shows that a man holding up the Leaning Tower of Pisa can result in ambiguous scenarios. ![image](https://hackmd.io/_uploads/S1aEYVywp.png) The ambiguiy can be solved by looking into the same 3D scene in multiple views. If the number of view is 2, the forming geometry is called Epipolar Geometry. ## What is epipolar geometry? ### Definition & Terms :::success The geometry that relates the stereo camera poses, points in 3D, and corresponding points projected in 2D plane. ::: <figure style="text-align: center"> <img src="https://hackmd.io/_uploads/SkjETXqw6.png =500x" /> </figure> ([img src](https://www.cs.cmu.edu/~16385/s15/lectures/Lecture18.pdf)) - **Baseline**: camera centers are located at $o$ and $o'$, and the line between them is referred to as the baseline. - **Epipoles**: The locations of where the baseline intersects the two images. - **Epipolar plane**: the plane defined by the two camera centers and $p$. - **Epipolar lines**: intersection of the epipolar plane and the two image planes. ### Goal :::info 1. **Solve correspondence problem**: find corresponding feature points within an image pair. - By essential matrix: `known` camera parameters - By fundamental matrix: `unknown` camera parameters 2. **Decompose** camera pose $R, t$ from essential matrix. 3. **Triangulation**: recover 3D structure in the world from stereo 2D images. ::: ## Goal 1: Solve correspondence problem ![image](https://hackmd.io/_uploads/HkEJKHkPp.png) Given a point $p_{l}$ of the image of the left camera, how can we find the corresponding point $p_{r}$ efficiently on the right image? One intuitive approach would be taking a patch around $p_{l}$, scanning through entire right image to compare all patches using similarity measurement. Obviously, it is inefficient. How to imrpove? **Epipolar constaint** comes to rescue. ### Epipolar constraint :::success Epipolar constraint reduces search space to a single line. ::: ![image](https://hackmd.io/_uploads/BkhmSdyPT.png) Intuition behind epipolar constaint is simple. Let's see left image first. Given a point $x$ on the left image, the candidates of reconstructed 3D point can be $p_{1}, p_{2}, p_{3}, ...$, because we cannot tell the depth from only one image. Let's consider stereo image pairs together. By knowing the camera centers $o, o'$ of left and right images, respectively, ==all potential matches for $x$ must lie on a single line $l'$ (**epipolar line**)==. The fact we call it **epipolar constraint**. With the epipolar constraint, we can search the corresponding feature point $x'$ for $x$ on a line instead of an entire image. ### Essential Matrix 1. **Fact 1: Camera transition** We can describe the transition of camera pose using **rotation matrix $R$** and **translation vector $t$**. So do the image pairs, and we got $x' = R(x-t)$. <figure style="border: none"> <img src="https://hackmd.io/_uploads/rys0pcJD6.png" width="300"> </figure> 2. **Fact 2: Coplanarity** We know that $\vec{a} \times \vec{b} = 0$ if $\vec{a}$ and $\vec{b}$ is perpendicular. Therefore, we can get the result $(x-t)^{T}(t \times x) = 0$ as the following pucture illustrates: ![image](https://hackmd.io/_uploads/B1tC0ckva.png) Combining fact 1 and fact 2, we can obtain the essential matrix from the following expressions: $$ \begin{align*} (x-t)^{T}(t \times x) = 0 \\ (R^{-1}x')^{T}(t \times x) = 0 \\ (R^{T}x')^{T}(t \times x) = 0 \\ (x'^{T}R^)(t \times x) = 0 \\ x'^{T}(R [t_{\times}]) x = 0 \\ x'^{T}E x = 0 \end{align*} $$ The matrix ==$E = R[t_{\times}]$== is defined as the Essential Matrix. The term is introduced by Longuet-Higgins 1981. :::warning We may derive the result $E = [t_{\times}]R$ if starts from $x^{T}E x' = 0$ ::: :::info Cross product denotation: $$ a \times b = \begin{bmatrix} 0 & -a_{z} & a_{y} \\ a_{z} & 0 & -a_{x} \\ -a_{y} & a_{x} & 0 \end{bmatrix} \begin{bmatrix} b_{x} \\ b_{y} \\ b_{z} \end{bmatrix} = [a_{\times}]b $$ ::: ### Relation between epipolar constraint and essential matrix :::success Given a point in the image, multiplying by the **essential matrix** can tell us the **epipolar line** in the second view. ::: Recall [epipolar constraint](https://hackmd.io/emuCyxF8QQiGdeZkvmpj3g?both#Epipolar-constraint) just mentioned. Corresponding matching point must appear on the epipolar line, which can be derived by multiplying an essential matrix. The reason is explained below. We know that the distance of a point to a line would be 0 ([proof](https://en.wikipedia.org/wiki/Distance_from_a_point_to_a_line)), thus obtain $x^{T}l = x'^{T}l' = 0$. Given the following facts: 1. $x'^{T}E x = 0$ 2. $x'^{T}l' = 0$ 3. $Ex$ and $l'$ are coplanar We can get the relationship between the essential matrix and the epipolar line: $$ Ex = l' $$ Similarly, we can get $$ E^{T}x' = l $$ ### Essential Matrix --> Fundamental Matrix :::warning The **Fundamental matrix** is a **generalization** of the **Essential matrix**, where the assumption of **calibrated cameras** is removed. ::: Until now, our assumption is based on knowing the camera intrinsic parameters. However, we should aim for a more general expression that accommodates situations where the camera parameters are unknown. The transformation between camera point and image point is defined as follow: $$ \begin{align*} \hat{x} = K^{-1} x \end{align*} $$ where $\hat{x}$ is camera points, $x$ is image point, and $K^{-1}$ is camera intrinsic matrix. By substituting the above formula into Longuet-Higgins equation: $$ \begin{align*} \hat{x'}^{T} E \hat{x} = 0 \\ x'^{T} K'^{-T} E K^{-1} x = 0 \\ x'^{T} K'^{-T} R[t_{\times}] K^{-1} x = 0 \\ x'^{T} F x = 0 \end{align*} $$ The matrix ==$F=K'^{-T} E K^{-1}$== is defined as the Fundamental Matrix, which acts similar to the Essential matrix from the previous section but also encodes information about the camera matrices $K, K'$. :::warning :closed_book: A brief summary: Fundamental Matrix --> epipolar line --> solve correspondense problem efficiently :boom: Amazingly, without knowing the actual position of $X$ in 3D space, or any of the extrinsic or intrinsic characteristics of the cameras, we can establish a relationship between any $x$ and $x'$ just using the concept of epipolar geometry. ::: ### How to solve the Fundamental matrix? :::success Algorithms candidates: 1. Linear(Non-iterative): 7-pt, 8-pt, normalized 8-pt 2. Non-linear(Iterative): Minimize epipolar error 3. RANSAC with 7-pt ::: #### Approach 1: (Linear) The Eight-Point Algorithm As the title suggests, the Eight-Point Algorithm assumes that a set of at least 8 pairs of corresponding points between two images is available. 1. Given correspondense $x_{i} = (u_{i}, v_{i}, 1)$ and $x'_{i} = (u'_{i}, v'_{i}, 1)$, then substitute these two points in epipolar constraint: $$ \begin{bmatrix} u'_{i}, v'_{i}, 1 \end{bmatrix} \begin{bmatrix} F_{11} & F_{12} & F_{13} \\ F_{21} & F_{22} & F_{23} \\ F_{31} & F_{32} & F_{33} \end{bmatrix} \begin{bmatrix} u_{i} \\ v_{i} \\ 1 \end{bmatrix} = 0 $$ 2. Then reformulate the epipolar constraint as follow: <figure style="border: none"> <img src="https://hackmd.io/_uploads/rJODI01va.png" width="500"> </figure> 3. Since we know the Fundamental matrix is defined [up to scale](https://stackoverflow.com/questions/17114880/up-to-a-scale-factor), we require eight of these constraints to determine the Fundamental matrix: <figure style="border: none"> <img src="https://hackmd.io/_uploads/ByvN-SlP6.png" width="500"> </figure> 4. Fundamental matrix $F$ can be derived by solving the [least square problem](https://hackmd.io/V0CbttFVTlGdfd2Feb2w5g?view#Homogeneous-Systems-of-Equations) with SVD. 5. However, we know that the true Fundamental matrix has rank 2 (Because $F^{T}e=0$). Therefore, we should look for the best rank-2 approximation of $\hat{F}$ of $F$. This problem is solved again by SVD, where $\hat{F} = U \Sigma V^{T}$, then the best rank-2 approximation can be found by <figure style="border: none"> <img src="https://hackmd.io/_uploads/HkR5rHxva.png" width="350"> </figure> <figure> <img src="https://hackmd.io/_uploads/SyLCl7iUT.png" width="350"> <figcaption> Visualization of singular(rank-2) matrix </figcaption> </figure> #### Approach 2: (Linear) The normalized Eight-Point Algorithm Here I will go through the algo steps only. For readers who are curious about details why normalized version is more stable, please refer to [paper1](https://www.cse.unr.edu/~bebis/CS485/Handouts/hartley.pdf) or [paper2](https://sites.ecse.rpi.edu/~qji/CV/8point.pdf). Algo Steps: 1. Normalize feature points with transformation matrix $T$ and $T'$. <figure style="text-align: center"> <img src="https://hackmd.io/_uploads/rJEeAWcD6.png" width=500 /> </figure> 2. Do regular Eight-Point Algorithm on normalized feature points to find $\hat{F}$. 3. Enforce the rank-2 constraint. 3. De-normalize $\hat{F}$ to original coordinate space by $F = T'\hat{F}T$. <br> (img [src](https://www.uio.no/studier/emner/matnat/its/nedlagte-emner/UNIK4690/v16/forelesninger/lecture_7_1-epipolar-geometry.pdf)) <figure style="text-align: center"> <img src="https://hackmd.io/_uploads/SJiBuzcwp.png" width=300 /> </figure> #### Approach 3: Non-linear Least Square Nonlinear approach aims to minimize sum of squared geometric distances by iterative methods like [Levenberg-Marquardt](https://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm): $$ \sum_{i}[dist(x_{i}', Fx_{i})^{2} + dist(x_{i}, F^{T}x_{i}')^{2}] $$ <figure style="text-align: center"> <img src="https://hackmd.io/_uploads/S1Jg4X5PT.png" width=500 /> </figure> #### Approach 4: seven-point algorithm Recall: [why does fundamental matrix have 7 degrees of freedom?](https://stackoverflow.com/questions/49763903/why-does-fundamental-matrix-have-7-degrees-of-freedom) ([blog](https://imkaywu.github.io/blog/2017/06/fundamental-matrix/))([slides](https://www.uio.no/studier/emner/matnat/its/nedlagte-emner/UNIK4690/v16/forelesninger/lecture_7_1-epipolar-geometry.pdf)) ==TODO== ### Summary One of beautiful parts of essential or fundamental matrix is that it enables to find a corresponding feature point without reconstructing 3D points. Although we did **imagine** an intersection point is there, we did not actually compute it. Of course, if you want, epipolar geometry is able to reconstruct 3D points. That's what "triangulation" technique does for us. Let's move on! ## Goal 2: Decompose $R, t$ from essential matrix We can recover rotation matrix $R$ and translation vector $t$ by decomposing it with SVD trick. (Check this [slides](https://inst.eecs.berkeley.edu/~ee290t/fa19/lectures/lecture10-3-decomposing-F-matrix-into-Rotation-and-Translation.pdf) or this [video](https://youtu.be/VADaMWyUZ7M?list=PLFI1Cd4723_SQ-h8W0kT3igOoUCYjWRHG&t=1533) for details) $$ \begin{align} E &= [t_{\times}]R = U \Sigma V^{T}, \ U = [u_{1}, u_{2}, u_{3}] \\ t_{\times} &= u_{3} \\ R &= UYV^{T} \end{align} $$ Although there are four possible solutions for $R, t$ by decomposing essential matrix, only one solution where points is in front of both cameras(left top). That's what we want! ![image](https://hackmd.io/_uploads/H1pZOM6Dp.png) ([image source](https://inst.eecs.berkeley.edu/~ee290t/fa19/lectures/lecture10-3-decomposing-F-matrix-into-Rotation-and-Translation.pdf)) $R, t$ decomposition is an important step for camera pose estimation. ## Goal 3: Triangulation :::success - Essential Matrix - Given: $p, p', K, K'$ - Solve: $P, R, T$ - Fundamental Matrix - Given: $p, p'$ - Solve: $P, R, T, K, K'$ - Triangulation - Given: $p, p', K, K', R, T$ - Solve: $P$ ::: This is a huge topic. I open a new post to discuss it. [(Triangulation)](/mrBvdst4SyiRvc68LK5qnA) ## Supplement - $E^{T}e = Ee' =0$ <br> Because for any point $x$ (other than $e$) in the image of camera 1, the corresponding epipolar line in the image of camera 2, $l' = E^{T}x$, contains the epipole $e'$. Thus $e'$ satisfies $e'l' = e'^{T}(E^{T}x)=(e'^{T}E^{T})x = 0$ for all the $x$, so $Ee' = 0$. Similarly $E^{T}e = 0$. - What is the relationship of [homography](https://hackmd.io/@jackyyeh/BJxMZtUUT/%2FUOW4nBmxT5CTILq3nZoNzw) and the fundamental matrix? ([link](https://www.quora.com/Computer-Vision-What-is-the-relationship-of-homography-and-the-fundamental-matrix)) - **Fundamental matrix** maps a point to a line: $l' = Fx$ - **Fundamental matrix** is a 3 x 3 matrix of rank2. - **Homography matrix** maps a point to a point: $x'=Hx$ - **Homography matrix** is a 3 x 3 matrix of rank3. - The main difference lies in the ==assumption of planarity==. Homography assumes a planar scene, where the transformation between images is a ==planar projective transformation==. On the other hand, the fundamental matrix is used in non-planar scenes with multiple viewpoints. :::success Planar projective transformation: homography is a 3x3 full-rank matrix. A plane in 3D after transformation is still a plane. ::: <figure style="text-align: center"> <img src="https://hackmd.io/_uploads/S15NDmcXR.png" width="350"> </figure> - In some cases, when dealing with rectified stereo images (images that have been rectified to make corresponding epipolar lines parallel), the relationship between the homography and the fundamental matrix simplifies. In this scenario, the homography can be used to compute the fundamental matrix and vice versa. ## References - https://web.stanford.edu/class/cs231a/course_notes/03-epipolar-geometry.pdf - https://www.cs.cmu.edu/~16385/s15/lectures/Lecture18.pdf - https://www.uio.no/studier/emner/matnat/its/nedlagte-emner/UNIK4690/v16/forelesninger/lecture_7_1-epipolar-geometry.pdf - https://github.com/polygon-software/python-visual-odometry/blob/master/Chapter%209%20-%20Structure%20from%20Motion.ipynb