# 3D and Depth ###### tags: `Deep Learning for Computer Vision` ## Applications of 3D vision ![](https://i.imgur.com/v8VSyX0.jpg) ## How do we represent the 3D world? * Multiview RGBD images <img src="https://i.imgur.com/OcGLPot.jpg" width="360"/> * Voxels <img src="https://i.imgur.com/EbA6aVj.png" width="320"/> * Polygonal Mesh <img src="https://i.imgur.com/tSG3IdC.png" width="320"/> * Point cloud <img src="https://i.imgur.com/8x5bctO.png" width="320"/> ## 3D Perception ![](https://i.imgur.com/hjy5lbw.png) ### Multi-view images: representation Represent object with images captured from multiple view points ![](https://i.imgur.com/xecOwzW.png) #### MVCNN ![](https://i.imgur.com/c1OBgat.png) * Extract image features with shared CNN1 * Aggregate features from all views with view pooling * Element-wise max operation across views * Closely related to max-pooling and max-out layers, with the only difference in the dimensions which are performed **problem ** * Sensitive to viewpoint set * Occlusion or invisible viewpoint * No info on geometry * Assume that static objects are captured from multiple viewspoint realistic ### Voxels: representation * Grids in fixed resolution $𝑥 × 𝑦 × 𝑧$ * Each grid contains 0/1: occupancy <img src="https://i.imgur.com/ztDAK6Y.png" width="320"/> #### 3D CNN ![](https://i.imgur.com/GJHeLvt.png) #### 3D ShapeNets: reuslts ![](https://i.imgur.com/XN79g7d.png) **problem** * low resolution ### Point Cloud: representation * Point cloud is a point set, representing 3D shapes * Each point is represented by coordinates (x, y, z) * Point cloud is stored as a 𝑁 × 3 matrix (N: point number, 3: coordinates) ![](https://i.imgur.com/Dp8GRud.png) * Point cloud could be obtained from LiDAR sensors * Capture scene geometry ![](https://i.imgur.com/onXXjsw.jpg) #### PointNet **Classification** ![](https://i.imgur.com/idBze9F.png) **Segmentation** ![](https://i.imgur.com/sUTOZws.png) #### Qualitative results ![](https://i.imgur.com/64L4uly.jpg) **problem** * Outlier/noisy points * Cannot capture textual * Not robust to transformation (e.g., scaling shifting, retatiom..) #### VoteNet ![](https://i.imgur.com/qayVyfa.jpg) * Supervised learning ![](https://i.imgur.com/OR5nRT5.png) **results** ![](https://i.imgur.com/tDA4P1F.jpg) ## 3D Reconstruction ### Unsupervised Depth estimation * Estimate the depth image from a single RGB input image **without supervision** * Render the **disparity map** from a single image * Depth info can be estimated from the disparity map * Disparity map is able to warp the left image to the right image (and vice versa) * Training data: **stereo image pairs** required <img src="https://i.imgur.com/zT232TB.jpg" width="400"/> #### Method ![](https://i.imgur.com/ivNXDQk.png) ### Voxel: 3D R2N2 ![](https://i.imgur.com/N9M6Pgz.png) ### Point Set Generation ![](https://i.imgur.com/UAXZKK4.png) ### Mesh: representation ![](https://i.imgur.com/jGsREvh.png) #### AtlasNet ![](https://i.imgur.com/izrKyTf.png) ![](https://i.imgur.com/pnrCACm.png) #### Mesh R-CNN ![](https://i.imgur.com/nPn23aw.jpg) ![](https://i.imgur.com/Z4AxB7j.jpg) ## Implicit Representation * Represent shapes as “function” * Tell us whether a point is on the surface ### Occupancy Network: method ![](https://i.imgur.com/1sQ3VIa.png) ![](https://i.imgur.com/S2ZSjqa.png) **problem** * No info on textual * Require post-processing to get mesh * Cannot handle complex scene ### NeRF ![](https://i.imgur.com/tFPf1dH.jpg) ![](https://i.imgur.com/6nU0JlV.png) **problem** * Only fit single scene * Require posed images * Time-consuming rendering (30s for 1 frame) * No explicit geometry ## Summary ![](https://i.imgur.com/akzDP8r.png)