Unsupervised Monocular Depth Estimation with Left-Right Consistency

# Unsupervised Monocular Depth Estimation with Left-Right Consistency ## [Github](https://github.com/mrharicot/monodepth#i-just-want-to-try-it-on-an-image) - my test ![](https://i.imgur.com/P4emgow.jpg) ![](https://i.imgur.com/3rbrGH9.png) - my test 2 ![](https://i.imgur.com/o0vCAvN.jpg) ![](https://i.imgur.com/ouabutC.png) https://openload.co/embed/r9y83gSF9K8 https://openload.co/embed/iO5qPL2928w https://openload.co/embed/b3xYbhb_POU http://tniesz.pl/tqxb0 = http://tniesz.pl/3gemd = http://tnij.tcz.pl/?u=188c0f --- - [UCL Team Project Web](http://visual.cs.ucl.ac.uk/pubs/monoDepth/) - 以下是從原論文作者的　talk　影片擷取出來的，出處：[CVPR 2017 Talk](https://www.youtube.com/watch?v=jI1Qf7zMeIs) 1 ![](https://i.imgur.com/mGLSqa6.png) 2 ![](https://i.imgur.com/lQRRjbJ.png) ### Why mono 3 ![](https://i.imgur.com/W9JNDCJ.png) 4 ![](https://i.imgur.com/AZocxzF.png) 5 ![](https://i.imgur.com/860YPii.png) 6 ![](https://i.imgur.com/U2F8IMB.png) 7 ![](https://i.imgur.com/YknqjRO.png) 8 ![](https://i.imgur.com/xQGsaZF.png) ### IR 9 ![](https://i.imgur.com/4uvtV1k.png) 10 ![](https://i.imgur.com/ujus4Q4.png) 11 ![](https://i.imgur.com/lU1g7RY.png) 12 ![](https://i.imgur.com/tB5qU3v.png) 13 ![](https://i.imgur.com/W8VWVHp.png) 14 ![](https://i.imgur.com/SJZYkOa.png) 15 ![](https://i.imgur.com/7Hs11JS.png) 16 ![](https://i.imgur.com/995CvD2.png) ### Key: 17 ![](https://i.imgur.com/N13T6bO.png) 18 ### Q:　CNN 如何產生 L/R Disparities? ![](https://i.imgur.com/LJhfUNZ.png) 19 ![](https://i.imgur.com/VFErzqW.png) 20 ![](https://i.imgur.com/VnI0sOy.png) 21 ![](https://i.imgur.com/6vZmACt.png) 22 ![](https://i.imgur.com/mkXMcaB.png) 23 ![](https://i.imgur.com/grC8nku.png) 24 ![](https://i.imgur.com/5x9Fmh6.png) ### Input 25 ![](https://i.imgur.com/Dle0Gic.png) 26 ![](https://i.imgur.com/Tg0YYRp.png) 27 ![](https://i.imgur.com/eYCESvL.png) 28 ![](https://i.imgur.com/SU3Fy3c.png) 29 ![](https://i.imgur.com/PI1YZUf.png) 30 ![](https://i.imgur.com/NXHlhmZ.png) --- ## Motivation - 想用CNN估計單眼深度 - 難以收集訓練數據 - 可以讓我從立體圖像中學習單眼深度估計？ ## Idea - 不是直接估計深度圖。 - 估計視差圖。 - 視差圖：右（左）圖像中左（右）圖像有多少對應於右（左）圖像中的相應像素？在使用兩個平行攝像機拍攝的圖像中，靠近攝像機的物體的左右兩者之間的位置差異大（無窮大偏差為 0）。 - 實際深度= Scale factor / Disparity Fang: 提出一個問題: 若電影 A 的 Scale factor = Sa，若電影 B 的 Scale factor = Sb， Ground truth 是 A，但預測電影 B 的時候，效果很差。主因或許是 Sa 不等於 Sb，因此是否有一種歸一化的方法，使得預測電影 B 的時候，效果可以提升。 ![](https://i.imgur.com/4GviKG7.png) ## Processing flow - 從左圖像生成 Disparity map 左右差距圖 - 根據原始圖像和 Disparity map 組合相對圖像 - 以原始圖像與合成圖像之間的差異作為損失 ![](https://i.imgur.com/YRT1VHX.png) 更具體地說，訓練時，校正過的雙眼相機，同時獲取一對影像，$I^l$ 及 $I^r$ 分別表示左側影像及右側影像，試著找到 dense correspondence field $d^r$。接著，$I^l(d^r)$ as $\widetilde{I}^r$，代表$d^r$應用於 $I^l$ 時，可以重建 $I^r$ (重建後的右側影像 $\widetilde{I}^r$ )。類似地，$I^r(d^l)$ as $\widetilde{I}^l$，也是同樣道理。 * 假如所有影像皆被 **rectified**, d 為影像視差，我們的model會學習並預測出每個像素的純量數字。 * 給定相機間的 baseline distance $b$ 及相機焦距 $f$，我們可以獲得深度 $\hat{d}$ $=$ $bf/d$。 - [deconvolutional](https://datascience.stackexchange.com/questions/6107/what-are-deconvolutional-layers) [conv_arithmetic](https://github.com/vdumoulin/conv_arithmetic) - ![](https://i.imgur.com/lFOdsHT.png) ## How to make opposite Image - 從右（左）圖像和左（右）disp 映射創建左（右）圖像 - 此操作必須可微 - 使用雙線性採樣器。[Spatial Transformer Networks](https://arxiv.org/abs/1506.02025) ## Bilinear Sampler ![](https://i.imgur.com/3GA08K8.png) - 它對應於左圖像的像素 $L[x, y]$：$R[x + disp_L[x，y], y]$ - 該坐標為 $R[floor（x + disp_L[x, y], y]$ , $R[ceil（x + disp_L[x，y], y]$ 的內部分割點。因此，通過將上述兩點的像素值加上內分數比並將它們相加在一起而獲得的點是 $disp_L[x, y]$-$floor(disp_L[x, y])$ : $ceil(disp_L[x, y])$ - $disp_L[x, y]$ → 此操作是可微分的。 - 問題：相鄰分割比是 $disp_L [x, y]$ 讓我們學習像素差異大的地方 → 引入 Disparity Smoothness Loss (後述)。 ![](https://i.imgur.com/Hsx6aoX.png) ![](https://i.imgur.com/wVHw3Os.png) ## Loss - Appearance Matching Loss：合成圖像與原始圖像之間的差異， - Disparity Smoothness Loss：距離接近的假設 - Left-Right Disparity Consistency Loss：假設左右差距圖有些相似 ### Appearance Matching Loss ![](https://i.imgur.com/cAZt2I0.png) [SSIM](https://zh.wikipedia.org/wiki/%E7%B5%90%E6%A7%8B%E7%9B%B8%E4%BC%BC%E6%80%A7): 結構相似性指標（英文：structural similarity index，SSIM index）是一種用以衡量兩張數位影像相似程度的指標。當兩張影像其中一張為無失真影像，另一張為失真後的影像，二者的結構相似性可以看成是失真影像的影像品質衡量指標。相較於傳統所使用的影像品質衡量指標，像是峰值信噪比（英文：PSNR），結構相似性在影像品質的衡量上更能符合人眼對影像品質的判斷。 [SSIM_paper](http://www.cns.nyu.edu/~zwang/files/papers/ssim.pdf) ### Disparity Smoothness Loss ![](https://i.imgur.com/c8wzlcZ.png) - ∂d: diparity locally smooth - ∂I: As depth discontinuities often occur at image gradients. - 垂直和水平像素的像素的視差應該差異不會太大 - 像素的變化是漸變 ### Left-Right Disparity Consistency Loss ![](https://i.imgur.com/Or4LqPV.png) - 從左（右）視差圖組合右（左）視差圖，並將它們彼此匹配 --- ### Introduction Understanding the shape of a scene from a single image, independent of its appearance. Monocular: 單眼應用有 1. synthetic object insertion in computer graphics 2. synthetic depth of field in computational photography 3. grasping in robotics 4. robot assisted surgery 5. automatic 2D to 3D conversion in film 6. self-driving cars * monocular depth estimation 人類作的很好 * 切入法：automatic depth estimation as an image reconstruction problem during training. * 此法不需要... * 此法改進之處... * 此法效能... * 其他方法... ![](https://i.imgur.com/Du60k9d.png) [disparity (視差)](http://blog.csdn.net/chentravelling/article/details/53671279) ### Related Work 由影像來做 depth estimation 的各種方法 - using pairs - several overlapping images captured from different viewpoints - temporal sequences - assuming a fixed camera, static scene, and changing lighting 以上方法通常需要超過一張有興趣區域的輸入影像，而我們想做的是只有一張輸入影像的 monocular depth estimation。 #### 背景知識 - Image Rectification (Stereo): 影像糾正 * Epipolar Geometry ( 極線幾何 )![](https://i.imgur.com/Yp8VRlt.png) * 當左右影像搜尋到相同物體時，其在epipolar line的距離(左圖藍框至右圖藍框的距離，單位是pixel)我們稱之為 disparity，其倒數就是 depth 深度。![](https://i.imgur.com/2xyic7r.png) ## Learning-Based Stereo - Typically the stereo pair is rectified and thus the problem of disparity (i.e. scaled inverse depth) estimation can be posed as a 1D search problem for each pixel. Recently, it has been shown that instead of using hand defined similarity measures, treating the matching as a supervised learning problem and training a function to predict the correspondences produces far superior results。 - Mayer et al. [39] introduced a fully convolutional [47] deep network called DispNet that directly computes the correspondence field between two images. - large amounts of accurate ground truth disparity data and stereo image pairs - Synthetic Data (合成資料)： ## Supervised Single Image Depth Estimation - only a single image is available at test time ! - patch-based model known as Make3D * have difficulty modeling thin structures and, as predictions are made locally, lack the global context required to generate realistic outputs. - Eigen et al. [10, 9] showed that it was possible to produce dense pixel depth estimates using a two scale deep network trained on images and their corresponding depth values. - CRFs. ## Unsupervised Depth Estimation - Flynn et al. [13] introduced a novel image synthesis network called DeepStereo that generates new views by selecting pixels from nearby images. ### Method 引入 novel depth estimation training loss 可讓我們訓練成對影像時，不需要 ground truth depth. #### 3.1 將深度估計當作影像重建來解 * 單一影像 $I$ ，我們希望可以學習一個 $f$ 可以預測每個像素 $\hat{d}$ = $f (I)$。 * 許多方法使用監督式學習，不實際且不準確。 * 我們的方法基於以下直覺：給定一校正過的雙眼相機，若可以學習到一函數，其可以從一影像重建另一影像，如此就學到該影像的三維影像的某些本質。 * 更具體地說，訓練時，校正過的雙眼相機，同時獲取一對影像，$I^l$ 及 $I^r$ 分別表示左側影像及右側影像，試著找到 dense correspondence field $d^r$。接著，$I^l(d^r)$ as $\widetilde{I}^r$，代表$d^r$應用於 $I^l$ 時，可以重建 $I^r$ (重建後的右側影像 $\widetilde{I}^r$ )。類似地，$I^r(d^l)$ as $\widetilde{I}^l$，也是同樣道理。 * 假如所有影像皆被 **rectified**, d 為影像視差，我們的model會學習並預測出每個像素的純量數字。 * 給定相機間的 baseline distance $b$ 及相機焦距 $f$，我們可以獲得深度 $\hat{d}$ $=$ $bf/d$。 #### 3.2. Depth Estimation Network ![](https://i.imgur.com/e7M6sgz.png) - The key insight of our method is that we can simultaneously infer both disparities (left-to-right and right-to-left), using only the left input image, and obtain better depths by enforcing them to be consistent with each other. why training do not require ground truth? DispNet + modifcations --- ## Overview [知乎](https://www.zhihu.com/question/53354718/answer/207687177) - 第一类仅仅依靠深度学习和网络架构得到结果 - 第二类依靠于深度信息本省的性质 - 第三类基于CRF的方法 - 第四类基于相对深度 - 第五类非监督学习