MeshLoc: Mesh-Based Visual Localization

# MeshLoc: Mesh-Based Visual Localization - 論文: https://arxiv.org/pdf/2207.10762.pdf - 參考 - 關於 Visual Localization: https://ithelp.ithome.com.tw/users/20121127/ironman/2900 - 論文整理1: https://zhuanlan.zhihu.com/p/579064315 - Abstract > Visual localization, i.e., the problem of camera pose estimation, is a central component of applications such as autonomous robots and augmented reality systems. A dominant approach in the literature, shown to scale to large scenes and to handle complex illumination and seasonal changes, is based on local features extracted from images. The scene representation is a sparse Structure-from-Motion point cloud that is tied to a specific local feature. Switching to another feature type requires an expensive feature matching step between the database images used to construct the point cloud. In this work, we thus explore a more flexible alternative based on dense 3D meshes that does not require features matching between database images to build the scene representation. We show that this approach can achieve state-of-the-art results. We further show that surprisingly competitive results can be obtained when extracting features on renderings of these meshes, without any neural rendering stage, and even when rendering raw scene geometry without color or texture. Our results show that dense 3D model-based representations are a promising alternative to existing representations and point to interesting and challenging directions for future research. ## introduction - Visual localization 主要利用從相機等來源的影像資料來估計位置與方向，在自駕車、自主機器人、AR、VR等領域中有極大用處 - 目前做到的2D-3D映射僅能針對具體目標執行，並以 3D Model 或是以 ML Model 表示(儲存) - 在 sparse point clouds 上的 3D point 皆已經根據兩個或以上的資料庫影像中發現的局部特徵進行三角測量 - 為啟動 2D 影像與 3D Model 之間的映射，每個 sparse point cloud 中 3D point 要跟對應的局部特徵對應起來 - 這麼做雖然能取得不錯的效果，但極其不靈活 - 當有更好的局部特徵被找到時，必須重新計算 sparse point cloud - 由於相機參數與位置皆已存在資料庫，重新進行三角測量即可，不過仍然非常耗時 - dense 3D model 之妙用 - 通常，可以從深度影像、雷達等來源取得 dense 3D model - 利用 dense 3D model 通常比SfM Model 更具彈性 - 不須將 3D sense point 與對應的資料庫影像中的局部特徵進行三角測量，直接從 dense 3D model 渲染出來的 deepth map 就可以拿到對應的 3D point 了 - 得益於電腦圖形領域近十年的研究，大型的3D model也可以在ms級別完成渲染，因此，特徵對應與渲染可以同時進行，不需要預先提取或儲存局部特徵 - 這表示我們可以選擇要儲存圖像或是實時渲染模型的視圖 - 對於實時渲染，要渲染到多細緻是一個值得考慮的問題 - 本文將以 dense 3D model 取代 SfM point cloud 實作基於特徵的 Visual localization - 探討如何基於 dense 3D model 設計 localization pipeline，並與 hierarchical localization systems 進行比較 - 利用任意的2D影像與其精確對應的3D Model，在最簡單的 dense 3D model localization pipeline 中效果已經非常不錯 - 相較於 SfM point cloud 的方法來說，基於 mesh 的方法在 Visual localization 中測試局部特徵與特徵匹配器的開銷較小 - 在不需要進行微調或重新訓練的情況下，當應用在原始3D場景(沒有任何顏色或紋理)幾何的渲染上時，表現出了出乎意料的良好效果，代表著標準的局部特徵可用來匹配圖像或是純粹的幾何 3D 模型，也就是說可以直接對 LiDAR 掃描儀等設備輸出的模型進行Visual localization - code and data opensource at https://github.com/tsattler/meshloc_release - Related work - 在 Visual localization 領域中現在最先進的演算法是基於局部特徵 - 其實現通常將場景表示成 sparse SfM point cloud - 其中的 3D point 皆已經根據對應資料庫影像中的特徵進行三角測量 - 在測試的時候，將利用 descriptor matching 建立 3D Model 與 query image 的 2D-3D 映射 - 為了擴展到更大的場景、處理複雜的光線與季節變化，通常會混合多種方法: - image retrieval stage 用於辨認一小部分相關的資料庫影像 - 利用 Descriptor matching 將觀察到對應的 3D 點限制在可見範圍中 - 研究表明，mesh-based scene representation 可以達成類似的結果，且對於新特徵的實驗更加容易 - 另一種將 3D scene geometry 表示 3D Model 的方法是將場景資訊包含在 ML Model 的 weight 中 - 比方說，利用 coordinate regression - 將 2D-3D matching 拿去做回歸 - 在小場景中的效果目前是最好的，不過在更大的場景中效果就不如人意了 - 絕對姿態與相對姿態也可以拿去做回歸 - 即便額外加入透過 view synthesis 得到的影像進行訓練，其效果仍比不上 feature-based 的方法 - 已經有人 - 用 dense Multi-View Stereo、laser point clouds、 textured/colored meshes 渲染不同的新視角，以便從與資料庫圖像差異很大的角度進行拍攝的圖像的定位 - 合成出來的場景、從估計的資料中渲染出來的姿勢都可以用於姿勢驗證 - 有兩種渲染技術 - 神經渲染技術 - Neural Radiance Fields (NeRFs) - image-to-image translation - 傳統的渲染技術 - 皆使用mesh localization - 給定精確的姿勢 $\rightarrow$ 從估計的姿勢渲染場景 <br>                          $\rightarrow$ 在實際圖像和渲染圖像之間進行特徵匹配 <br>                          $\rightarrow$ 得到一組用來改善姿勢的 2D-3D 映射集合 - 由於資料集來源，對於城市中效果較好，對於山中的效果不太好 - 由於不可能從從稀疏的影像資料庫中計算 SfM point cloud ，將使用紋理化的數字地理模型（textured digital elevation model）來表示他們的場景 - 對於模糊定位(100km以上)，mesh 方法的精確度可達 cm 級別 ## Feature-based Localization via SfM Models - hierarchical structure-based localization pipeline 1. 影像檢索: 利用 image-level descriptor 的 nearest neighbor search(NNS)，以從資料庫影像中找出與query相關的參考特徵 2. 2D-2D Feature Matching - 建立 query image 與最近 k 個最相關特徵之間的關聯，這些關聯將被用來建立 2D-3D Feature Matching - 通常使用最先進的學習型本地特徵 - 會在完整的特徵比對之前進行 - 可能會用 Lowe’s ratio test 之類的 outlier filters - learned matching strategies 也是一種方法 - 有兩種方法 - 事先計算 database images 的 feature - 需要較多儲存空間 - runtime 開銷較小 - 儲存照片並即時提取特徵 - 需要較少儲存空間 - runtime 開銷較大 - 以 Aachen Day-Night v1.1 dataset 舉例 - 儲存 SuperPoint feature 需要25GB以上 - 圖片只需7.5GB(如果將解析度降到最大800px，則只需2.5GB) 3. 2D-2D match $\rightarrow$ 2D-3D match 對於第 $i$ 個3D scene point $p_i \in R^3$，每個 SfM point cloud 以 (image, feature) 的集合 $\{({I_i}_1, \ {f_i}_1) \ \dots \ ({I_i}_n, \ {f_i}_n)\}$ 儲存 - $({I_i}_j, \ {f_i}_j)$ 表示影像 ${I_i}_j$ 的特徵 ${f_i}_j$ 用於對3D point $p_i$ 進行三角測量 - 如果 query 中的特徵與影像 ${I_i}_j$ 的特徵 ${f_i}_j$有關聯，即表示該特徵與 $p_i$ 有關連也就是說，藉由尋找與匹配資料庫特徵對應的 3D 點，以取得 2D-3D match 4. 姿態估計 - 將前面做出來的 2D-3D match 搭配 LO-RANSAC 模型/演算法(? 進行精確的姿態估計 - 在每次迭代中，P3P solver會根據三個 2D-3D 匹配的最小集合產生 pose hypotheses - 對所有 inliers(包括 LO-RANSAC 內部與外部) 進行 Non-linear refinement，以對姿態進行最佳化 - Covisibility filtering - 並非所有符合匹配的 3D point 都是可見的 - 通常會使用 covisibility filter 處理： - 藉由SfM，重新定義了所謂的可見性圖 $G \ = \ ((I, \ P), \ E)$ - (二分圖) bipartite graph - 一組節點 $P$ 對應於資料庫影像 - 一組節點 $P$ 對應於 3D 點 - 如果 3D point在影像中具有對應的特徵，則 $G$ 包含影像節點和點節點之間的 edge - 在 2D-3D match 中，集合$M = \{(f_i, p_i)\}$ 定義了 $G$ 的子圖 $G(M)$ - 在 $G(M)$ 裡面的 component 包含的 3D points 可能一同可見 - 因此，姿勢估計是根據連接的 components 而不是所有匹配來完成的 ## Feature-based Localization without SfM Models hierarchical localization pipeline using dense scene representations 1. 影像檢索 - 與目前的 hierarchical structure-based localization 的影像檢索方法基本一致 - 在檢索的同時，將 2D-3D matchs 從 dense representations 拿出來 2. 2D-2D Feature Matching - 與目前的 hierarchical structure-based localization 的 2D-2D Feature Matching 基本一致 3. 2D-2D match $\rightarrow$ 2D-3D match - 當使用從 dense model 獲得的 depth map 時，每個具有有效深度的資料庫特徵 ${f_i}_j$ 會有一個對應的3D point ${p_i}_j$， - 每一個 3D point都將精確投影到對應的特徵上，如資料庫特徵中有雜訊也會一併傳播過去 - 代表著儘管 ${f_i}_1, \ \dots \ , {f_i}_n$ 雖然都是同一個 3D point 的雜訊，對應到的模型點${p_i}_1, \ \dots \ , {p_i}_n$ 可能會略有不同 - 假設 query feature $p$ 與特徵 ${f_i}_j$、${f_i}_k$ 匹配，可以得到多個不同的 3D 匹配 $(q,{p_i}_j)$、$(q,{p_i}_k)$ - 有兩個方法可以處理多個匹配 - 使用所有獨立的匹配 - 實現容易 - 會產生大量匹配，使得後續的 RANSAC 姿態匹配速度降低 - 將使姿態估計的結果偏向於尋找可以產生更多匹配的姿態 - 將多個 2D-3D match 合併成一個 - 以從 query feature $q$ 產生的 2D-3D math 集合 $M(q) \ = \ \{(q_i,p_i)\}$ 為基準估計單一 3D point $p$，得到唯一的 2D-3D 對應 $(q,p)$ - 集合 $M(q)$ 可能存在錯誤的匹配，因此，將藉由以下步驟來平均稀釋掉雜訊，以提高 3D point 的精確度 - 嘗試使用与匹配點對應的數據庫特徵 $\{f_i\}$ 來找到一致的集合 - 對於每個匹配的3D點 $p_i$ - 測量相對於資料庫特徵的重投影誤差 (reprojection error)，並計算誤差在給定閾值內的特徵的數量 - 最佳化重投影誤差平方和來細化具有最大數量 inliers 的點 - 如果不存在至少有兩個 inliers 的點 $p_i$，保留 $M(q)$ 的所有匹配 4. 姿態估計 - 與目前的 hierarchical structure-based localization 的影像檢索方法姿態估計 - 不過需要對 Covisibility filtering 做一點調整 - 額外在利用 RANSAC 進行姿態估計之前加入了 position averaging 的步驟 - Covisibility filtering - Dense scene representations 不直接提供在 visibility graph G 中記錄的 co-visibility relations - 一種解法是利用 depth maps來計算 visibility - 不過開銷很大 - 更有效的解法是透過與 query feature 共享的匹配來動態建立 visibility graph - 如果 query feature $q$ 與特徵 $f \in I_i$ 和 $f_j \in I_j$ 之間至少存在一對匹配 $(q,f_i)$ 和 $(q,f_j)$，則 $I$ 和 $J$ 中可見的 3D ponit 被視為 co-visible - 2D-2D匹配（以及對應的2D-3D匹配）定義了一組互相連接的 components，可以對每個 component 執行姿態估計 - 然而動態產生的 co-visibility relations 近似於G中編碼的可見性關係 - 影像 $I_i$ 與 $I_j$ 可能不共享 3D point，但可以觀察到與另一個影像 $I_k$ 相同的 3D point - 因此，在圖 $G(M)$ 中，對影像 $I_i$ 和 $I_j$ 找到的2D-3D匹配屬於 single connected component - 在即時近似中如果影像 $I_k$ 不在不在前幾個檢索到的影像中則可能會遺失該連接 - 因此使用即時近似的 Covisibility filtering 可能過於激進，導致匹配集的過度分割和定位性能下降 - Position averaging - 姿態估計方法的輸出是相機姿態 R、位置 c 和對應姿態的2D-3D匹配 - $R \in {\mathbb{R}}^{3\times3}$ 是從全域模型座標到相機座標的旋轉 - $c \in {\mathbb{R}}^3$ 是相機在全域座標中的位置 - 根據經驗，估計的旋轉通常比估計的位置更準確 - 因此，採用一個簡單的方案來細化位置 c - 將邊長為$2 \times d_{vol}$ 的體積圍繞位置 c - 在體積內的每個方向以步長 $d_{step}$ 定期採樣新位置 - 對於每個這樣的位置 $c_i$ 統計inliers的數量 $I_i$ - 求得新的位置估計值 $c'$ $$c'=\frac{1}{\sum_{i}I_i}\sum_{i}I_i \cdot c_i$$ - 這個方法與 dense scene representation 無關，應該可以說是通用的？ ## Experimental Evaluation ## Conclusion