# 3D and Depth
###### tags: `Deep Learning for Computer Vision`
## Applications of 3D vision

## How do we represent the 3D world?
* Multiview RGBD images
<img src="https://i.imgur.com/OcGLPot.jpg" width="360"/>
* Voxels
<img src="https://i.imgur.com/EbA6aVj.png" width="320"/>
* Polygonal Mesh
<img src="https://i.imgur.com/tSG3IdC.png" width="320"/>
* Point cloud
<img src="https://i.imgur.com/8x5bctO.png" width="320"/>
## 3D Perception

### Multi-view images: representation
Represent object with images captured from multiple view points

#### MVCNN

* Extract image features with shared CNN1
* Aggregate features from all views with view pooling
* Element-wise max operation across views
* Closely related to max-pooling and max-out layers, with the only difference in the dimensions which are performed
**problem **
* Sensitive to viewpoint set
* Occlusion or invisible viewpoint
* No info on geometry
* Assume that static objects are captured from multiple viewspoint realistic
### Voxels: representation
* Grids in fixed resolution $𝑥 × 𝑦 × 𝑧$
* Each grid contains 0/1: occupancy
<img src="https://i.imgur.com/ztDAK6Y.png" width="320"/>
#### 3D CNN

#### 3D ShapeNets: reuslts

**problem**
* low resolution
### Point Cloud: representation
* Point cloud is a point set, representing 3D shapes
* Each point is represented by coordinates (x, y, z)
* Point cloud is stored as a 𝑁 × 3 matrix (N: point number, 3: coordinates)

* Point cloud could be obtained from LiDAR sensors
* Capture scene geometry

#### PointNet
**Classification**

**Segmentation**

#### Qualitative results

**problem**
* Outlier/noisy points
* Cannot capture textual
* Not robust to transformation (e.g., scaling shifting, retatiom..)
#### VoteNet

* Supervised learning

**results**

## 3D Reconstruction
### Unsupervised Depth estimation
* Estimate the depth image from a single RGB input image **without supervision**
* Render the **disparity map** from a single image
* Depth info can be estimated from the disparity map
* Disparity map is able to warp the left image to the right image (and vice versa)
* Training data: **stereo image pairs** required
<img src="https://i.imgur.com/zT232TB.jpg" width="400"/>
#### Method

### Voxel: 3D R2N2

### Point Set Generation

### Mesh: representation

#### AtlasNet


#### Mesh R-CNN


## Implicit Representation
* Represent shapes as “function”
* Tell us whether a point is on the surface
### Occupancy Network: method


**problem**
* No info on textual
* Require post-processing to get mesh
* Cannot handle complex scene
### NeRF


**problem**
* Only fit single scene
* Require posed images
* Time-consuming rendering (30s for 1 frame)
* No explicit geometry
## Summary
