Try   HackMD

[summary] Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images (2018)

[arxiv, papers with code]

Summary

Approach

Pixel2Mesh is an end-to-end network, which takes an RGB 2D input image and produces a triangular 3D mesh by performing a few sequential mesh refinements applied to the initial ellipsoid mesh.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The initial mesh is an ellipsoid with fixed number of vertices (156), fixed axis radiuses (0.2m, 0.2m, 0.4m), and fixed relative location from the camera (0.8m). As a note, it is unclear from the work whether camera intrinsics are fixed across the training examples.

Throughout the paper, the mesh shape is considered to be a graph where mesh vertices are graph vertices, mesh edges are graph edges, and each vertex carries a feature vector in addition to the coordinates. These feature vectors are used in the graph convolution network described later.

Network architecture

  • Input image processing is performed by VGG-16 convolutions, which produce feature maps used at the mesh refinement stages. Each mesh refinement stage except the very first one consists of a mesh deformation block and a mesh/graph unpooling layer.
  • The mesh deformation block consists of two major parts: a perceptual feature pooling layer and a graph convolution network. It takes vertex coordinates Ci-1 and vertex features Fi-1 of the mesh as input and produces refined coordinates Ci and features Fi.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • The perceptual feature pooling block works by projecting the mesh onto the image plane by using known camera intrinsics, and matching projections of the mesh vertices Ci-1 with the feature maps produced by the 2D convolutions of the input image. Because the resolution of the mesh and the pooled feature maps do not match, bilinear interpolation is used. Essentially, for each mesh vertex this process extracts a vector—a perceptual feature—from the 2D feature maps.
  • The graph convolution network takes vertex features Fi-1 concatenated with extracted perceptual features, performs a series of graph convolutions and produces updated vertex coordinates and features.
  • To better match the ground truth, the resolution of the mesh shape is increased by the graph unpooling layer. For each edge, this layer creates a new vertex splitting this edge in half. Afterwards, the newly created vertices are connected, which splits each face into 4 smaller faces.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Losses

To train the network, a combination of four different losses is used: the Chamfer loss, a normal loss, a laplacian regularization, and an edge length regularization. These losses are balanced with some manually selected coefficients.

  • The Chamfer loss measures the distance of each point to the closest point in the other point set, and essentially ensures that the produced mesh more or less matches the ground truth:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • The normal loss forces the edges of a particular vertex to be perpendicular to the surface normal in the closest corresponding ground truth vertex:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • To ensure that the mesh deformation is not too aggressive (which might lead to e.g. mesh self-intersection), the laplacian regularization is used. First, a laplacian coordinate is computed for each vertex:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
    After that, the laplacian regularization is defined as an L2 difference of laplacian coordinates before and after a deformation block:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Finally, to penalize flying vertices, the edge length regularization is used:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Experiments

  • Some qualitative results:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Ablation study examples: it is easy to see how the edge length regularization or the normal loss help to produce smoother shapes:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Metrics

  • Comparison against 3D-R2N2, PSG, and N3MR using the F-score metric on the ShapeNet dataset. Precision and recall used by the F-score are calculated by computing the fraction of mesh vertices which have a corresponding ground truth vertex within a certain radius τ.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Same comparison using the Chamfer distance (CD) and the earth mover's distance (EMD).
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Discussion

  • VGG-16 architecture progressively decreases the input image resolution in the later stages where the mesh finetuning happens. Could switching to a UNet-like architecture be better at preserving the details?
  • Perceptual feature pooling does not take into account that some mesh vertices might be occluded from the camera and therefore should not have corresponding pixels in the 2D feature maps.
  • It seems that the produced 3D mesh will always be homotopic to a sphere, so there is no way to handle non-contiguous shapes.

Used diagrams are taken from the paper with minor changes.
All errors present in the interpretation are my own.