Visual Odometry

# Visual Odometry <style> body > .ui-infobar, body > .ui-toc, body > .ui-affix-toc { display: none !important; } </style> <style> figure { border: 1px #cccccc solid; padding: 4px; margin: auto; } figcaption { background-color: black; color: white; font-style: italic; padding: 1px; text-align: center; } </style> The term "Visual Odometry" was originally coined for an analogy with wheel odometry, a method that calculates a vehicle's motion by integrating the rotations of its wheels over time. Similarly, Visual Odometry functions by incrementally estimating the pose of a vehicle by analyzing changes in the images captured by its onboard cameras in response to its motion. When I first encountered this term "visual odometry", I got so confused, especially trying to understand how it differs from VSLAM and SFM. So let us first clarify the relationship of these confusing term: <figure style="text-align: center"> <img src="https://hackmd.io/_uploads/Bk2HGkCLT.png" width=500 /> </figure> Finally I realized visual odometry is an another way to say ==sequential SFM== from Prof. Davide's great [slides](https://rpg.ifi.uzh.ch/docs/teaching/2019/09_multiple_view_geometry_3.pdf). The structure of this post is also mainly based on it. The overall VO pipeline is shown below. This post will be mainly focusing on untangling different type of motion estimation algorithms (2D-2D, 3D-3D, 3D-2D), leaving the rest of components to other posts. <figure style="text-align: center"> <img src="https://hackmd.io/_uploads/BJOicTaDp.png" width=500 /> <figcaption>Visual odometry pipeline</figcaption> </figure> ### Algo 1: 2D-to-2D / Motion from Image Feature Correspondences 2D-2D motion estimation is actually an image pair correspondence problem covered in this post ([epipolar geometry](https://hackmd.io/emuCyxF8QQiGdeZkvmpj3g)). Our aim is to solve the essential matrix across sequential frames($I_{k-1}, I_{k}$). With essential matrix, we can recover camera pose by decomposing it into rotation matrix and translation vector. :::success Algorithm 2D-to-2D steps: 1) Capture new frame $I_{k}$ 2) Extract and match features between $I_{k-1}$ and $I_{k}$ 3) Compute essential matrix for image pair $I_{k-1}$ and $I_{k}$ 4) Decompose essential matrix into $R_{k}$ and $t_{k}$ 6) Compute relative scale and rescale $t_{k}$ accordingly 5) Recover camera pose transformation $T_{k}$ from $R_{k}$ and rescaled $t_{k}$ 8) Repeat from (1) ::: Note that for step(6), scale factor will never be known in the monocular scheme. <figure style="text-align: center"> <img src="https://hackmd.io/_uploads/B1jly1CwT.png" width=700 /> <figcaption>2D-2D motion estimation</figcaption> </figure> ### Algo 2: 3D-to-2D / Motion from 3D Structure and Image Correspondences Algo 1 did not use any 3D information. But in fact as long as corresponding feature pair is given, we are able to reconstruct 3D point using [triangulation](https://hackmd.io/mrBvdst4SyiRvc68LK5qnA) method. Then [Perspective-from-n-Points](https://hackmd.io/XoqLoirfTHmv0RN7n9e-Cw?both) approach can be applied to recover the camera pose. :::success Algorithm 3D-to-2D steps: 1) Do only once (cloud point initialization): 1-1) Capture two frames $I_{k-2}$ and $I_{k-1}$ 1-2) Extract and match features between $I_{k-2}$ and $I_{k-1}$ 1-3) Triangulate features from $I_{k-2}$ and $I_{k-1}$ 2) Do at each iteration: 2-1) Capture new frame $I_{k}$ 2-2) Extract and match features between $I_{k-1}$ and $I_{k}$ 2-3) Compute camera pose (PnP) from 3D-to-2D matches 2-4) Triangulate all new feature matches between $I_{k-1}$ and $I_{k}$ 2-5) Iterate from (2-1) ::: <figure style="text-align: center"> <img src="https://hackmd.io/_uploads/HJTAHlCwT.png " width=700 /> <figcaption>3D-2D motion estimation</figcaption> </figure> ### Algo 3: 3D-to-3D / Motion from 3D Structure and Image Correspondences Another approach, here the camera motion $T_{k}$ can be computed by aligning transformation of the two 3-D feature sets. In order to do this, we have to triangulate 3-D feature points at each time step, so it is necessary to use a stereo camera for this algorithm. :::success Algorithm 3D-to-3D steps: 1) Capture two stereo image pairs ($I_{l, k-1}, I_{r, k-1}$) and ($I_{l, k}, I_{r, k}$) 2) Extract and match features between $I_{l, k-1}$ and $I_{l, k}$ 3) Triangulate matched features for stereo pair ($I_{l, k}, I_{r, k}$) to get $X_{k}$ 4) Compute $T_{k}$ from 3-D features $X_{k-1}$ and $X_{k}$ 6) Repeat from (1) ::: <figure style="text-align: center"> <img src="https://hackmd.io/_uploads/Hk5rUgADp.png" width=700 /> <figcaption>3D-3D motion estimation</figcaption> </figure> ## Monocular vs. Binocular(Stereo) | | Monocular | Binocular | | -------- | -------- | -------- | | Computational Efficiency | O | X | | Scale Ambiguity | O | X | 3D Reconstruction within a step | X | O | Robustness | X | O ## References - [Multiple View Geometry part-3 slides by Prof. Davide](https://rpg.ifi.uzh.ch/docs/teaching/2019/09_multiple_view_geometry_3.pdf) - [Visual Odometry tutoial (Part I: The First 30 Years and Fundamentals)](https://rpg.ifi.uzh.ch/docs/VO_Part_I_Scaramuzza.pdf)