Try   HackMD

【論文閱讀】DepthAnything v1/v2

Abstract

  • 目的
    RGB影像透過模型進行單目相機的深度估計monocular depth estimate簡稱MDE
  • 挑戰
    單目相機深度估計需要一些條件才能使用,例如相機焦距,已知物件大小與初始距離,才有可能定位一個物體距離相機的距離

三大貢獻

  1. 使用合成(synthetic images)影像替換所有有標記影像(labeled Image) 來進行訓練
  2. Teacher Model參數增加 (Scale Up)
  3. 透過大量的Persudo-labeled Image來訓練學生模型

成果展示

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Figure2中則是針對兩個屬性來展示模型效果

  1. 誤導性的影像
  2. 具有大量細節影像

Preferable Properties 適合屬性

評估模型時會考量到的定性屬性,也就是非量化的評估

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  1. Fine Detail 影像細節
    對於細節部分包含紋理、邊緣
  2. Transparent Object 透明物件
    對於具有折射、反射與穿透性物件,
  3. Reflections 反射影像
    對於反光物件與鏡面場景
  4. Complex Scenes複雜場景
    對於多層次以及多物件的場景
  5. Efficiency
    對於模型的效能與耗時
  6. Transferability
    可否遷移至不同應用場景與裝置當中

Introduction

點出Data的重要性

深度估計模型需具備的特點

  1. 能夠針對複雜場景生成穩定的預測結果,包括但不限於複雜布局,透明物件、反射表面(鏡子、螢幕)
  2. 預測深度圖中需要包含精細細節,包括但不限於薄物件與小孔等
  3. 提供不同模型大小與推理效率,以支援廣泛應用
  4. 預訓練模型需要提供足夠的通用性,特徵可以有效地轉移到下游任務使用,包含分類、檢測、等等

備註
包括但不限於: 表示不一定要有(有點文字遊戲

三個問題與解答

  1. 對於MiDas的粗略的深度估計是否來自判別模型本身?是否採用擴散(diffusion)方式才能有精細特徵?
    • 不,判別模型能然可以產生精細的細節,關鍵在於是將所有的真實標籤影像替換成精確的合成影像
  2. 如果合成影像已經優於真實標籤影像,為何以往訓練模型方式仍然堅持採用真實影像?
    • 合成影像有其缺點,但這些缺點在以往的學習範式中難以克服
  3. 如何避免合成影像的問題並且放大其優點?
    • 合成影像只有使用在訓練Teacher Model,然後透過大規模偽標註真實影像訓練較小的Student Model.

Revisiting the Labeled Data Design of Depth Anything V1

這章節點出以往研究嘗試建構大量的真實標籤影像來進行訓練,但是忽略了真實標籤的問題。

真實標籤問題

  1. 深度感測器無法有效捕捉透明物件的深度 (Figure 3a)
  2. 匹配演算法對於重複pattern無法有效運作 (Figure 3b)
  3. SFM方法對於動態物件標記錯誤 (Figure 3c)
  4. 在預測模型中對於物件的細節太過於粗糙。
    無法有效學習邊緣與細節的深度特徵,將導致預測過於平滑
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

合成影像的優點

  1. 所有細節(紋理、邊緣、細孔、小物件)都能夠被正確地標記,用engine生成的
  2. 可以反應透明物件與反射表面的實際深度
    Figure4 展示了使用真實影像訓練與合成影像訓練之後模型效果差別,可以發現使用合成影像進行訓練模型細節部分對於細節生成比較好
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

Challenges in Using Synthetic Data

上個章節點出生成影像的好處,但實際上還是有一些問題存在

Limitation

  1. 合成影像與真實影像存在一定的偏差,即使現在的模擬器能夠生成逼真的影像,仍然不足以匹配上真實影像,由於合成影像的生成太過於乾淨有序,這樣難以將合成影像的資料分布轉移至真實影像的分布,一定存在bias
  2. 合成影像的場景範圍受限制
    • Apple-Hypersim/KITTI
    • WEB Stereo Images HRWSI, MegaDepth Dataset

實驗方式證明 合成資料在訓練深度估計模型上是不可行的,即使是使用較大的模型也是一樣,精細度在細節上能有差距。

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Key Role of Large-Scale Unlabeled Real Images

解決方案: 包含真實世界的未標記資料的標籤,其標籤由大模型生成

  • Bridge the domain gap.
    • 由於domain差距,先透過合成資料集訓練一個大模型,再將真實影像丟到大模型進行預測,產生pseudo-labeled真實影像,再進行訓練小模型,更能讓模型學習到真實世界分布
  • Enhance the scene coverage.
    • 合成影像資料的多樣性有限,因此需要更多真實世界影像來填補多樣性
    • 拿更多公有資料集丟入到大模型產生persudo label real image
      • 也就是資料集的有效範圍
  • Transfer knowledge from the most capable model to smaller ones.
    • 類似知識蒸餾的訓練方式,但是是基於資料方面上的,而非輸出的機率分布

Depth Anything V2

Framework 框架

  1. 透過大量且有效的合成資料集訓練大模型
  2. 透過大模型產生精確的Persudo label real Image
  3. 訓練student model在大量的Persudo label real Image
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

基本上就是在資料集上下很大功夫

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Detail

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  • 在訓練過程,忽略前
    n
    大的損失函數數值,
    n
    設置為
    10%
    ,這些過高loss的區域視為潛在的干擾pseudo-labels
  • Lgm
    能夠對深度銳利程度有很大幫助
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →