Structure from Motion

# Structure from Motion <style> figure { /* border: 1px #cccccc solid; */ padding: 4px; margin: auto; text-align: center; } figcaption { background-color: black; color: white; font-style: italic; padding: 1px; text-align: center; } </style> Right now we go beyond geometry of two cameras to multiple cameras. With observed points from multiple views, we will be able to simultaneously determine both **3D structure** of the scene and **camera motion** simultaneously. This problem is framed as structure from motion. ## Problem Definition :::success Given: 1. $m$ images of $n$ fixed 3D points 2. $mn$ 3D -> 2D correspondences $$ x_{ij} = M_{i}X_{j} ~~~~~~~~~~~~~~~ i=1 \ldots m; ~~ j = 1 \ldots n $$ Estimate: 1. $m$ projection matrices $M_{i}$ 3. $n$ 3D points $X_{j}$ ::: <figure> <img src="https://hackmd.io/_uploads/ry12OF496.png" width="400"> </figure> - **3D structure** = 3D points $X_{j}$ - **Camera motion** = projection matrices $M_{i}$ Note that **camera motion** stands for $M=K[R ~~~ t]$, describing how 3D points project to 2D plane by both intrinsic and extrinsic matrics. While **camera pose** stands for extrinsic matric $[R ~~~ t]$, is a part of camera motion, refering to 3D coordinate transformation from world system to camera system. Don't be confused! ## Types of SFM Algo * **Approach 1: Two-view / Fundamental matrices-based SFM** The basic idea behind two-View SFM is to estimate camera poses through determination of a fundamental matrix between correspondences in the image pair. Subsequently, the 3D structure is inferred by triangulating these correspondences. * **Approach 2: Factorization-based SFM** Factorization method decomposes a set of observed image points into the product of two separate factors: $$ \underbrace{D}_\text{observed points} = \underbrace{M}_\text{motion}~ \underbrace{S}_\text{structure} $$ * **Approach 3: Bundle adjustment for SFM** Bundle adjustment estimates the parameters of camera poses and 3D scenes altogether with a non-linear optimization. ## Types of Reconstruction Ambiguity For SFM problem, we care about the reconstruction ambiguity. It represents how much the reconstructed 3D structure is close to the real structure. | Name | Matrix | #DoF | Preserves | Shape | | -------- | -------- | -------- |--- |--- | | translation | $\begin{bmatrix} I & t \\ 0^{T} & 1 \end{bmatrix}_{3 \times 4}$ | 3 | orientation + ... | <img src="https://hackmd.io/_uploads/SyFMEh55a.png" width="50"> | | rigid / euclidean |$\begin{bmatrix} R & t \\ 0^{T} & 1 \end{bmatrix}_{3 \times 4}$ | 6 | lengths + ... | <img src="https://hackmd.io/_uploads/B1cQEh556.png" width="50"> | similarity | $\begin{bmatrix} sR & t \\ 0^{T} & 1 \end{bmatrix}_{3 \times 4}$ | 7 | angles + ... | <img src="https://hackmd.io/_uploads/B1cQEh556.png" width="60"> | | affine | $\begin{bmatrix} A & t \\ 0^{T} & 1 \end{bmatrix}_{3 \times 4}$ | 12 | parallelism + ... | <img src="https://hackmd.io/_uploads/HJbSV2956.png" width="60"> | | perspective | $\begin{bmatrix} A & t \\ v^{T} & 1 \end{bmatrix}_{3 \times 4}$ | 15 | straight lines + ... | <img src="https://hackmd.io/_uploads/ByurE3ccp.png" width="60"> | > If you are not familiar with image transformation types mentiond above, please refer to [image transformation post](https://hackmd.io/U5Mg1QQMSb-DU3gCzxhn-Q) for details. Without constraints imposed on the camera calibration matrix or on the scene, a projective reconstruction is obtained with a projective ambiguity. Require additional constraints for upgrading the reconstruction to affine, similarity, or Euclidean to reduce construction ambiguity. Alright. So far we have an overview of what SFM is, the approach to solve SFM, and potential ambiguity of reconstruction. Let's dive in each approach in detail and discuss their corresponding reconstruction ambiguity. ## Two-view / Fundamental matrices-based SFM As title states, this method is specifically designed for a two-view scenario. But this method can be extended to multi-view SFM problem by iteratively computing every two image pairs. <figure> <img src="https://hackmd.io/_uploads/S1iksR8Fp.png" width=400 /> </figure> ### Algo Two-view SFM basically consists of [essential matrix estimation](https://hackmd.io/@jackyyeh/BJxMZtUUT/%2FemuCyxF8QQiGdeZkvmpj3g) + [triangulation](https://hackmd.io/@jackyyeh/BJxMZtUUT/%2FmrBvdst4SyiRvc68LK5qnA). Typically, the camera parameters are assumed to be known. The algorithm steps are listed below: :::info **Steps** 1. Solve Fundamental matrix $F$ (e.g. normalized 8-point algo) 2. Obtain Essential matrix $E=K'^{T}FK$ 3. Decompose camera poses$(R, t)$ from Essential matrix 4. Triangulation for 3D structure ::: ### Reconstruction Ambiguity For calibrated cameras, the only ambiguity present is the similarity ambiguity. This arises from the assumption that the coordinate system of the first camera is equivalent to the world coordinate system. Alternatively, if we were to consider the coordinate system of the second camera as the world coordinate system, the disparity between them would be described by a rotation and translation matrix. And intuitively, there is no way to recover the absolute scale of a scene from images. Therefore, the solution of two-view SFM is unique up to a similarity matrix. <figure> <img src="https://hackmd.io/_uploads/SkdcanSsT.png" width="500"/> </figure> ## Factorization-based SFM Factorization method decomposes a set of observed image points into the product of two separate factors: $$ \underbrace{D}_\text{observed points} = \underbrace{M}_\text{motion}~ \underbrace{S}_\text{structure} $$ This method allows processing all observed points across a set of images simultaneously. It helps reduce the effects of noise substantially. :::info Assumption: There exists a set of features on the tracked objects throughout the image sequence. ::: Usually in order to simplify problem, some constraints will be brought in. Next, We will discuss the factorization method with constraint of affine camera model and orthgraphic camera model. Please refer to [camera model post](https://hackmd.io/@jackyyeh/BJxMZtUUT/%2FI2EeEnKSSW6606aA1Gkdcg) if you are not familiar with different types of camera models. ### Factorization-based SFM with Affine Camera Model #### Problem Definition Reformulation Reformulate the projection formula $x_{ij} = M_{i}X_{j}$ using affine projection $M = \begin{bmatrix} A & b \\ 0 & 1 \end{bmatrix}$: :::success Given: 1. $m$ images of $n$ fixed 3D points 2. $mn$ 3D -> 2D correspondences $$ x_{ij} = A_{i}X_{j} + b_{i} ~~~~~~~~~~~~~~~ i=1 \ldots m; ~~ j = 1 \ldots n $$ Estimate: 1. $m$ projection matrices $A_{i}$ 2. $m$ translation vectors $b_{i}$ 3. $n$ 3D points $X_{j}$ ::: #### Algo steps 1. **Center the data by subtracting the centroid of the image points in each view** ![image](https://hackmd.io/_uploads/r1waibN9a.png) <br> After centering, the problem becomes simpler because ==$b_{i}$ is removed. We can focus on solving $A_{i}$==. <br> You may ask why we can place world coorinate origin as we wish. Well, what we really care is relative position of 3D structure and each camera. Therefore, the specific location of the world coordinate origin becomes insignigicant. 2. **Create a measurement matrix** Measurement matrix consists of a set of observed points across the image sequence. $$ \hat{x}_{ij} = A_{i}X_{j} $$ ![image](https://hackmd.io/_uploads/ry1lYQNc6.png) 3. **Factorize $D$** According to linear algebra, we know that for a $m \times n$ matrix: $$ \begin{equation} \tag{1} rank(A) \le min(m, n) \end{equation} $$ and for matrix multiplication: $$ \begin{equation} \tag{2} rank(AB) \le min(rank(A), rank(B)) \end{equation} $$ <br> Based on (1) and (2), we know the measurement matrix $D = MS$ have maximum rank 3. So we can factorize $D$ using SVD and keep top 3 singular values, and discard the remaining. ![image](https://hackmd.io/_uploads/rJUzimNqa.png) 4. **Recover the motion and structure matrices** One of possible solution can be obtained below: $$ M = U\Sigma^{\frac{1}{2}}, ~~~~S = \Sigma^{\frac{1}{2}}V^{T} $$ ![image](https://hackmd.io/_uploads/Bkz46XVcT.png) <br> But the solution is not unique. We can multiply an arbitratry invertible matrix $Q$ in between the equation, deriving another solution $M', S'$: $$ D = MS = (MQ^{-1})(QS) = M'S' $$ #### Reconstruction Ambiguity The solution is not unique. Or sometimes we say the solution is unique up to an affine transformation. ![image](https://hackmd.io/_uploads/HJwr2tsj6.png) ### Factorization-based SFM with Orthographic Camera Model ref: C. Tomasi and T. Kanade [Shape and motion from image streams under orthography: A factorization method.](https://people.eecs.berkeley.edu/~yang/courses/cs294-6/papers/TomasiC_Shape%20and%20motion%20from%20image%20streams%20under%20orthography.pdf) In the previous section, we realized the solution is ambiguous even after the affine camera model assumption is made. To elimiate ambiguity, let us put stronger constraint, switching our camera model to an **orthographic** one. ### Orthographic Camera Model Orthographic camera has ==orthonormal axes== ($A = \begin{bmatrix} a_{1} \\ a_{2} \end{bmatrix}$) like below: ![image](https://hackmd.io/_uploads/HklQA6jo6.png) ### Algo 1. **Obtain a solution** We can obtain a solution from factorization-based SFM with affine camera model: $$ D = MS = \begin{bmatrix} A_{1} \\ A_{2} \\ \vdots \\ A_{m} \end{bmatrix} \begin{bmatrix} X_{1} & X_{2} & \cdots & X_{m} \end{bmatrix} $$ According to last section, we know it can multiply an arbitratry invertible matrix in between the equation: $$ D = MS = MQQ^{-1}S \begin{bmatrix} A_{1}\color{red}{Q} \\ A_{2}\color{red}{Q} \\ \vdots \\ A_{m}\color{red}{Q} \end{bmatrix} \begin{bmatrix} \color{red}{Q^{-1}}X_{1} & \color{red}{Q^{-1}}X_{2} & \cdots & \color{red}{Q^{-1}}X_{m} \end{bmatrix} $$ 2. **Enforce orthographic constraint** Due to we are using orthographic camera right now, $A_{i}Q$ in step 1 represents orthonormal axes: $$ (A_{i}Q)(A_{i}Q)^{T} = A_{i}(QQ^{T})A_{i}^{T} = I_{2\times2} $$ Therefore, each frame can contribute three constraints: $$ \begin{align} a_{1}QQ^{T}a_{1} = 1 \\ a_{2}QQ^{T}a_{2} = 1 \\ a_{1}QQ^{T}a_{2} = 0 \end{align} $$ 3. **Solve Q Matrix** $Q$ is a $3 \times 3$ matrix with 9 variables. According to step 2, each frame can contribute 3 constraints. Therefore, $Q$ can be solved with 3 or more frames. Please refer to this [link](https://math.stackexchange.com/questions/1812669/solving-3x3-matrix-q-using-nonlinear-least-squares-or-cholesky-decomposition?fbclid=IwAR2tS77WB4Rp2wpIYWI2Li_Xfgv2mf58fMO77zw3nfCCKMvfYhRSRTxplF8) for the linear solution details. ### Reconstruction Ambiguity Orthographic camera model can eliminate the affine ambiguity. ## Two-view vs. Factorization-based SFM - In two-view SFM, we compute structure and motion every other two images, which is subject to accumulated noise. While in factorization-based SFM, we compute structure and motion from a series of images at once. - For factorization-based, every point cannot be occluded. - Solution of both methods has ambiguity. ## Bundle Adjustment-based Method Non-linear method for refining structure and motion. Optimize camera motion and 3D structure simultaneously by minimizing reprojection error. $$ Loss(M, X) = \sum\limits_{i=1}^m \sum\limits_{j=1}^n D(x_{ij} - proj(M_{i}X{j}))^{2} $$ <figure> <img src="https://hackmd.io/_uploads/SJsWHCjiT.png" width="400"> </figure> ## References - https://www.cs.cmu.edu/~16385/s19/lectures/lecture12.pdf - https://web.stanford.edu/class/cs231a/course_notes/04-stereo-systems.pdf - http://mrsl.grasp.upenn.edu/loiannog/tutorial_ICRA2016/VO_Tutorial.pdf - [Tomasi-Kanade Factorization | Structure from Motion by Prof. Shree K. Nayar](https://www.youtube.com/watch?v=0BVZDyRrYtQ) - [cs231a camera models course note](https://web.stanford.edu/class/cs231a/course_notes/01-camera-models.pdf) - http://16720.courses.cs.cmu.edu/lec/sfm.pdf