# [summary] Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images (2018)
\[[arxiv](https://arxiv.org/abs/1804.01654), [papers with code](https://paperswithcode.com/paper/pixel2mesh-generating-3d-mesh-models-from)\]
## Summary
### Approach
Pixel2Mesh is an end-to-end network, which takes an RGB 2D input image and produces a triangular 3D mesh by performing a few sequential mesh refinements applied to the initial ellipsoid mesh.

The **initial mesh** is an ellipsoid with fixed number of vertices (156), fixed axis radiuses (0.2m, 0.2m, 0.4m), and fixed relative location from the camera (0.8m). As a note, it is unclear from the work whether camera intrinsics are fixed across the training examples.
Throughout the paper, the mesh shape is considered to be a graph where mesh vertices are graph vertices, mesh edges are graph edges, and each vertex carries a feature vector in addition to the coordinates. These feature vectors are used in the graph convolution network described later.
#### Network architecture
- Input image processing is performed by VGG-16 convolutions, which produce feature maps used at the **mesh refinement** stages. Each mesh refinement stage except the very first one consists of a mesh deformation block and a mesh/graph unpooling layer.
- The **mesh deformation block** consists of two major parts: a perceptual feature pooling layer and a graph convolution network. It takes vertex coordinates C<sub>i-1</sub> and vertex features F<sub>i-1</sub> of the mesh as input and produces refined coordinates C<sub>i</sub> and features F<sub>i</sub>.

- The **perceptual feature pooling** block works by projecting the mesh onto the image plane by using known camera intrinsics, and matching projections of the mesh vertices C<sub>i-1</sub> with the feature maps produced by the 2D convolutions of the input image. Because the resolution of the mesh and the pooled feature maps do not match, bilinear interpolation is used. Essentially, for each mesh vertex this process extracts a vector—a perceptual feature—from the 2D feature maps.
- The **graph convolution network** takes vertex features F<sub>i-1</sub> concatenated with extracted perceptual features, performs a series of graph convolutions and produces updated vertex coordinates and features.
- To better match the ground truth, the resolution of the mesh shape is increased by the **graph unpooling layer**. For each edge, this layer creates a new vertex splitting this edge in half. Afterwards, the newly created vertices are connected, which splits each face into 4 smaller faces.
<img src="https://i.imgur.com/iliQYM9.png" alt="graph unpooling layer" style="width:600px;"/>
#### Losses
To train the network, a combination of four different losses is used: the Chamfer loss, a normal loss, a laplacian regularization, and an edge length regularization. These losses are balanced with some manually selected coefficients.
- The **Chamfer loss** measures the distance of each point to the closest point in the other point set, and essentially ensures that the produced mesh more or less matches the ground truth:
<img src="https://i.imgur.com/pC0ad2N.png" alt="Chamfer loss" style="width:350px; margin-bottom: 10px;"/>
- The **normal loss** forces the edges of a particular vertex to be perpendicular to the surface normal in the closest corresponding ground truth vertex:
<img src="https://i.imgur.com/n4QS5rw.png" alt="normal loss" style="width:500px; margin-bottom: 10px;"/>
- To ensure that the mesh deformation is not too aggressive (which might lead to e.g. mesh self-intersection), the **laplacian regularization** is used. First, a laplacian coordinate is computed for each vertex:
<img src="https://i.imgur.com/hjtDq8K.png" alt="normal loss" style="width:230px; margin-bottom: 10px;"/>
After that, the laplacian regularization is defined as an L2 difference of laplacian coordinates before and after a deformation block:
<img src="https://i.imgur.com/zYnGvgQ.png" alt="normal loss" style="width:110px; margin-bottom: 10px;"/>
- Finally, to penalize flying vertices, the **edge length regularization** is used:
<img src="https://i.imgur.com/G8d6UDp.png" alt="normal loss" style="width:250px; margin-bottom: 10px;"/>
### Experiments
- Some qualitative results:
<img src="https://i.imgur.com/QCQdJ7y.png" alt="qualitative results" style="width:650px; margin-bottom: 20px;"/>
- Ablation study examples: it is easy to see how the edge length regularization or the normal loss help to produce smoother shapes:
<img src="https://i.imgur.com/CzRF3P3.png" alt="ablation study examples" style="width:650px; margin-bottom: 20px;"/>
### Metrics
- Comparison against [3D-R2N2](https://arxiv.org/abs/1604.00449), [PSG](https://arxiv.org/abs/1612.00603), and [N3MR](https://arxiv.org/abs/1711.07566) using the F-score metric on the [ShapeNet](https://shapenet.org/) dataset. Precision and recall used by the F-score are calculated by computing the fraction of mesh vertices which have a corresponding ground truth vertex within a certain radius τ.
<img src="https://i.imgur.com/GjBKchE.png" alt="F-score comparison" style="width:650px; margin-top: 10px; margin-bottom: 15px;"/>
- Same comparison using the Chamfer distance (CD) and the [earth mover's distance](https://en.wikipedia.org/wiki/Earth_mover%27s_distance) (EMD).
<img src="https://i.imgur.com/zOZVQMO.png" alt="CD/EMD comparison" style="width:650px; margin-top: 10px; margin-bottom: 15px;"/>
## Discussion
- VGG-16 architecture progressively decreases the input image resolution in the later stages where the mesh finetuning happens. Could switching to a UNet-like architecture be better at preserving the details?
- Perceptual feature pooling does not take into account that some mesh vertices might be occluded from the camera and therefore should not have corresponding pixels in the 2D feature maps.
- It seems that the produced 3D mesh will always be homotopic to a sphere, so there is no way to handle non-contiguous shapes.
<br/>
<small>Used diagrams are taken from the paper with minor changes.<br/>All errors present in the interpretation are my own.</small>