論文筆記 Direct Sparse Odometry

# Direct Sparse Odometry J. Engel, V. Koltun, D. Cremers., *Direct sparse odometry, IEEE transactions on pattern analysis and machine intelligence*, vol. 40, pp. 611-625, 2017. [TOC] ## Abstract >fully direct VO, backend nonlinear optimization on pose & depth We propose a **direct sparse** VO, it combines a fully direct probabilistic model (minimizing a photometric error), joint **optimization** geometry (represented as **inverse depth** in a reference frame) and **camera motion**. Since our method does not depend on keypoint detectors or descriptors, it can naturally sample pixels from across all image regions that have intensity gradient, including edges or smooth intensity variations on mostly white walls. The proposed model integrates a full photometric calibration, accounting for exposure time, lens vignetting, and non-linear response functions. - sparse + indirect → ORB-SLAM - dense + indirect - dense + direct - sparse + direct → DSO ## Back-end DSO is based on continuous optimization of the photometric error, windowed sparse bundle adjustment for all parameters: camera intrinsics, camera extrinsics, inverse depth values. 3D points are represented as inverse depth in a reference frame and thus have 1 DoF. ### Notations - bold upper case $\mathbf{H}$ → matrix - bold lower case $\mathbf{x}$ → vector - light upper case $I$ → function, image - light lower case $t$ → scaler - transformation matrix $\mathbf{T}_i\in\text{SE}(3)$ → pose ### Calibration **Geometric camera calibration** pin-hole → fish eye lens **Photometric calibration** non-linear response function: $G:\mathbb{R}\rightarrow[0,255]$ lens attenuation(vignetting): $V:\Omega\rightarrow[0,1]$ $$ I_i(\mathbf{x})=G\big(t_iV(\mathbf{x})B_i(\mathbf{x})\big) $$ $B_i$: irradiance image, $t_i$: exposure time $$ I_i^\prime(\mathbf{x})=t_iB_i(\mathbf{x})=\frac{G^{-1}\big(I_i(\mathbf{x})\big)}{V(\mathbf{x})} $$ ### Photometric error >The photometric error of a point $\mathbf{p} ∈ Ω_i$ in reference frame $I_i$, observed in a target frame $I_j$, is the weighted Sum of Squared Differences over a neighborhood of pixels. **weighted photometric energy term** $$ E_{\mathbf{p}j}=\sum_{\mathbf{p}\in {N_\mathbf{p}}}w_{ \mathbf{p}}\big|\big|(I_j[\mathbf{p}^\prime]-b_j)-\frac{t_je^{a_j}}{t_ie^{a_j}}(I_i[\mathbf{p}]-b_i)\big|\big|_\gamma $$ $a,b$: affine brightness parameters, $t_i$: exposure time, $||\cdot||_\gamma$: Hubert norm $$ \mathbf{p}^\prime=\Pi_\mathbf{c}\big(\mathbf{R}\Pi_\mathbf{c}^{-1}(\mathbf{p},d_\mathbf{p})+\mathbf{t}\big),\ \text{ where }\begin{bmatrix}\mathbf{R}&\mathbf{t}\\0&1\end{bmatrix}=\mathbf{T}_j\mathbf{T}_i^{-1} $$ $\mathbf{p}^\prime$ → $\mathbf{p}$ 經過 $\Pi_\mathbf{c}^{-1}$ 轉到世界座標，過 relative pose 再經過 $\Pi_\mathbf{c}$ 轉回相機座標 $$ w_{\mathbf{p}}=\frac{c^2}{c^2+||\nabla I_i(\mathbf{p})||_2^2} $$ gradient-dependent weighting: $||\nabla I_i(\mathbf{p})||_2^2$ 越大 $w_{\mathbf{p}}$ 越小，反之亦然 **photometric error** \begin{equation} E_{\text{photo}}=\sum_{i\in F}\sum_{\mathbf{p}\in P_i}\sum_{j\in\text{obs}(\mathbf{p})}E_{\mathbf{p}j}\tag{1} \end{equation} $i$ runs over all frames $F$ , $\mathbf{p}$ over all points $P_i$ in frame $i$, and $j$ over all frames $\text{obs}(\mathbf{p})$ in which $\mathbf{p}$ is visible ### Factor graph ![image](https://hackmd.io/_uploads/BkGIAbkH6.png =500x) 四個點：$\mathbf{p}_1,\mathbf{p}_2,\mathbf{p}_3,\mathbf{p}_4$ $E_{\mathbf{p_i}j}$: $\mathbf{p}_i$ 被 $j_{\text{th}}$ frame 觀測到 keyframe(前端)提供 $\mathbf{T},a,b,d$ KF1提供點$\mathbf{p}_1$，他被KF2、KF3觀測到，因此可以建構兩個node $E_{\mathbf{p_1}2}$($\mathbf{T}_1,a_1,b_1,\mathbf{T}_2,a_2,b_2,d_1$)、$E_{\mathbf{p_1}3}$($\mathbf{T}_1,a_1,b_1,\mathbf{T}_3,a_3,b_3,d_1$) least squares：$\sum E_{\mathbf{p_i}j}$ → $E_{\text{photo}}$ >DSO並不在意點之間的對應關係 ### Windowed optimization optimize the total error Equation $(1)$ in a **sliding window** using the Gauss-Newton $$ \mathbf{H}=\mathbf{J}^T\mathbf{W}\mathbf{J}\ \text{ and }\ \mathbf{b}=-\mathbf{J}^T\mathbf{Wr}\implies \mathbf{\delta}=\mathbf{H}^{-1}\mathbf{b},\ \mathbf{\tilde x}\leftarrow\mathbf{x}+\mathbf{\delta} $$ where $\mathbf{W}\in\mathbb{R}^{n\times n}$ is the diagonal matrix containing weights, $\mathbf{r}\in\mathbb{R}^n$ is the residual vector, $\mathbf{J}\in\mathbb{R}^{n\times d}$ is the Jacobian of $\mathbf{r}$ > remark weight here: > $$ w_{\mathbf{p}}=\frac{c^2}{c^2+||\nabla I_i(\mathbf{p})||_2^2} $$ > remark D3VO: > $$ w_{\mathbf{p}}=\frac{c^2}{c^2+||\tilde\Sigma(\mathbf{p})||_2^2} $$ >$w_p$ is the information matrix in least squares, D3VO incorporates uncertainty $\tilde\Sigma$ with weighted photometric energy **Marginalization** When the active set of variables becomes too large, old variables are removed by marginalization using the Schur complement. ## Front-end - determines the sets $F,P_i,\text{obs}(\mathbf{p})$ that make up the error terms of $E_{\text{photo}}$ - provides initializations for new parameters, required for optimizing the highly non-convex energy function $E_{\text{photo}}$ (The linearization of the image $I$ is only valid in a 1-2 pixel radius; hence all parameters involved in computing $\mathbf{p}^\prime$ should be initialized sufficiently accurately for $\mathbf{p}^\prime$ to be off by no more than 1-2 pixels.) - decides when a point / frame should be marginalized ![image](https://hackmd.io/_uploads/S1gFCJMyr6.png =600x) ### Frame management window up to $N_f=7$ frames ``` (step 1) → discard or create new keyframe if new keyframe: (step 2) → new photometric error optimized (step 3) → marginalize one or more frames ``` **(step 1) initial frame tracking** down sampling **(step 2) keyframe creation** 1. field of view change $$ f =\Big(\frac{1}{n}\sum_{i=1}^n||\mathbf{p}-\mathbf{p}^\prime||^2\Big)^{\frac{1}{2}} $$ 2. translation cause occlusion/disocculusion (遮擋/非遮擋) $$ f_t=\Big(\frac{1}{n}\sum_{i=1}^n||\mathbf{p}-\mathbf{p}^\prime_t||^2\Big)^{\frac{1}{2}},\ \text{ where }\mathbf{p}^\prime_t\text{ is warped with }\mathbf{R}=\mathbf{I}_{3\times3} $$ 3. exposure time change $$ a=|\log(e^{a_j-a_it_jt_i^{-1}})| $$ a new keyframe is taken if $w_ff+w_{f_t}f_t+w_aa>T_{\text{keyframe}}=1$ **(step 3) keyframe marginalization** let $I_1$ be the newest keyframe and $I_n$ be the oldest keyframe 1. always keep $I_1$ and $I_2$ 2. if more than $N_f=7$ frames are active, marginalize $I_m,\ m\in[3,n]$ with the highest distance score $s(I_m)$ $$ s(I_i)=\sqrt{d(i,1)}\sum_{j\in[3,n]-\{i\}}\big(d(i,j)+\epsilon\big)^{-1} $$ where $d(i,j)$ is the Euclidean distance between $I_i,I_j$ and $\epsilon$ is a small constant This scoring function is heuristically designed to keep active keyframes well-distributed with more keyframes close to the most recent one. A keyframe is marginalized by first marginalizing all points represented in it, and then the frame itself. To preserve the sparsity, all observations of still existing points in the frame are dropped from the system. ### Point management always keep a fixed number $N_p = 2000$ of active points, equally distributed across space and active frames. ``` 首先在N_f=7的keyframe，每幀都取N_p=2000個點 ** coarse depth value ** 在N_f*N_p=14000個點再篩選N_p=2000個點 if new keyframe: (step 1) → candidate point selection (取N_p=2000個點) (step 2) → candidate point tracking (coarse depth value) (step 3) → candidate point activate ``` **(step 1) candidate point selection** (use raw image) goal: (1) well distributed in image (2) high image gradient magnitude (edge detection) **threshold** split image to 32 × 32 blocks. For each block, compute the threshold as $\bar{g} + g_{\text{th}}$, where $\bar{g}$ is the median absolute gradient over all pixels in that block and $g_{\text{th}}=7$ **select** split image to d × d blocks, from each block select the pixel with largest gradient if it surpasses threshold **repeat** → capture weaker gradient points, decrease threshold with 2d x 2d and 4d x 4d **(step 2) candidate point tracking** point candidates are tracked in subsequent frames using a discrete search along the epipolar line, minimizing the photometric error $E_{\mathbf{p}j}$ (LSD-SLAM) **(step 3) candidate point activate** (after old points are marginalized) (1) project all existing points onto the most recent keyframe (2) project candidate points into the most recent keyframe (3) activate candidate points which maximize the distance to any existing point ## Open Source https://github.com/JakobEngel/dso