【論文閱讀】DepthAnything v1/v2
Abstract
- 目的
RGB影像透過模型進行單目相機的深度估計monocular depth estimate簡稱MDE
- 挑戰
單目相機深度估計需要一些條件才能使用,例如相機焦距,已知物件大小與初始距離,才有可能定位一個物體距離相機的距離
三大貢獻
- 使用合成(synthetic images)影像替換所有有標記影像(labeled Image) 來進行訓練
- 將Teacher Model參數增加 (Scale Up)
- 透過大量的Persudo-labeled Image來訓練學生模型
成果展示
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Figure2中則是針對兩個屬性來展示模型效果
- 誤導性的影像
- 具有大量細節影像
Preferable Properties 適合屬性
評估模型時會考量到的定性屬性,也就是非量化的評估
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- Fine Detail 影像細節
對於細節部分包含紋理、邊緣
- Transparent Object 透明物件
對於具有折射、反射與穿透性物件,
- Reflections 反射影像
對於反光物件與鏡面場景
- Complex Scenes複雜場景
對於多層次以及多物件的場景
- Efficiency
對於模型的效能與耗時
- Transferability
可否遷移至不同應用場景與裝置當中
Introduction
點出Data的重要性
深度估計模型需具備的特點
- 能夠針對複雜場景生成穩定的預測結果,包括但不限於複雜布局,透明物件、反射表面(鏡子、螢幕)
- 預測深度圖中需要包含精細細節,包括但不限於薄物件與小孔等
- 提供不同模型大小與推理效率,以支援廣泛應用
- 預訓練模型需要提供足夠的通用性,特徵可以有效地轉移到下游任務使用,包含分類、檢測、等等…
備註
包括但不限於: 表示不一定要有(有點文字遊戲
三個問題與解答
- 對於MiDas的粗略的深度估計是否來自判別模型本身?是否採用擴散(diffusion)方式才能有精細特徵?
- 不,判別模型能然可以產生精細的細節,關鍵在於是將所有的真實標籤影像替換成精確的合成影像
- 如果合成影像已經優於真實標籤影像,為何以往訓練模型方式仍然堅持採用真實影像?
- 合成影像有其缺點,但這些缺點在以往的學習範式中難以克服
- 如何避免合成影像的問題並且放大其優點?
- 合成影像只有使用在訓練Teacher Model,然後透過大規模偽標註真實影像訓練較小的Student Model.
Revisiting the Labeled Data Design of Depth Anything V1
這章節點出以往研究嘗試建構大量的真實標籤影像來進行訓練,但是忽略了真實標籤的問題。
真實標籤問題
- 深度感測器無法有效捕捉透明物件的深度 (Figure 3a)
- 匹配演算法對於重複pattern無法有效運作 (Figure 3b)
- SFM方法對於動態物件標記錯誤 (Figure 3c)
- 在預測模型中對於物件的細節太過於粗糙。
無法有效學習邊緣與細節的深度特徵,將導致預測過於平滑
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
合成影像的優點
- 所有細節(紋理、邊緣、細孔、小物件)都能夠被正確地標記,用engine生成的
- 可以反應透明物件與反射表面的實際深度
Figure4 展示了使用真實影像訓練與合成影像訓練之後模型效果差別,可以發現使用合成影像進行訓練模型細節部分對於細節生成比較好
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Challenges in Using Synthetic Data
上個章節點出生成影像的好處,但實際上還是有一些問題存在
Limitation
- 合成影像與真實影像存在一定的偏差,即使現在的模擬器能夠生成逼真的影像,仍然不足以匹配上真實影像,由於合成影像的生成太過於乾淨,有序,這樣難以將合成影像的資料分布轉移至真實影像的分布,一定存在bias
- 合成影像的場景範圍受限制
- Apple-Hypersim/KITTI
- WEB Stereo Images HRWSI, MegaDepth Dataset
以實驗方式證明 合成資料在訓練深度估計模型上是不可行的,即使是使用較大的模型也是一樣,精細度在細節上能有差距。
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Key Role of Large-Scale Unlabeled Real Images
解決方案: 包含真實世界的未標記資料的標籤,其標籤由大模型生成
- Bridge the domain gap.
- 由於domain差距,先透過合成資料集訓練一個大模型,再將真實影像丟到大模型進行預測,產生pseudo-labeled真實影像,再進行訓練小模型,更能讓模型學習到真實世界分布
- Enhance the scene coverage.
- 合成影像資料的多樣性有限,因此需要更多真實世界影像來填補多樣性
- 拿更多公有資料集丟入到大模型產生persudo label real image
- Transfer knowledge from the most capable model to smaller ones.
- 類似知識蒸餾的訓練方式,但是是基於資料方面上的,而非輸出的機率分布
Depth Anything V2
Framework 框架
- 透過大量且有效的合成資料集訓練大模型
- 透過大模型產生精確的Persudo label real Image
- 訓練student model在大量的Persudo label real Image
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
基本上就是在資料集上下很大功夫
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Detail
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- 在訓練過程,忽略前大的損失函數數值, 設置為,這些過高loss的區域視為潛在的干擾pseudo-labels
- 能夠對深度銳利程度有很大幫助
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →