Single-view Surface Reconstruction with PIFu

# Single-view Surface Reconstruction from 2D images with PIFu ### Nguyễn Duy Tiến SE172942, Mai Tuấn Hưng SE172877 ## 1. Abstract * Traditional methods for 3D reconstruction often require multiple images from different viewpoints or depth sensors to create a 3D model. ![](https://hackmd.io/_uploads/Hk4QXK7a2.png) * But now, the power of neural network that has changed the game. ![](https://hackmd.io/_uploads/r1m5XFman.png) ## 2. Introduction ### 3D data representation methods ![](https://hackmd.io/_uploads/BJyMgtXTn.png) 3D representation of the Stanford bunny: (Left) : Point Cloud representation. (Middle) Voxel representation. (Right) Mesh representation. * **Point clouds representation**: a set a 3D vertices with coordinates (x,y,z). This is a very common representation, as it is the one usually captured by 3D scanners. * **Voxel representation**: A voxel representation is a way to describe a three-dimensional space using small cube-shaped units called "voxels." Just like how a digital image is made up of pixels, a 3D space can be divided into these voxels. Each voxel can store information about the presence, absence, or properties of objects in that specific region of space. Example: Minecraft utilizes a voxel-based representation for its world-building and gameplay. * **Meshes representation**: a set of vertices connected to each other in order to form triangles (or sometimes quadrilaterals). The set of triangles forms a 2D surface in a 3D space. * **Signed distance functions (SDFs)** : a signed distance function F, takes as an input a 3D point P=(x,y,z) and outputs the distance between P to the closest point on the 3D surface. The function F(x,y,z) is zero on the surface, negative inside the surface and positive outside of the surface. Therefore the function F encodes the position of the surface in the 3D space. Note that signed distance functions can be transformed into a mesh representation with the marching cube algorithm. ![](https://hackmd.io/_uploads/S1qufFQT2.png) ### What is PIFu? - PIFu stands for Pixel-aligned Implicit Function - Objective: a computer vision technique used for reconstructing 3D shapes of objects or scenes from 2D images - It can infer both 3D surface and texture from a single image ### The advantages of PIFu: * **Memory efficient**: PIFu is considered memory efficient compared to some other 3D reconstruction methods because it doesn't require the storage of large voxel grids or point clouds to represent 3D shapes. Instead, it directly generates the 3D surface geometry using neural networks ![](https://hackmd.io/_uploads/HkPtSt7p3.png) * **Spatial alignment**: Spatial alignment in the context of PIFu (Pixel-Aligned Implicit Function) refers to the process of aligning a 2D image with a 3D reconstruction to ensure that the pixel information from the image accurately corresponds to the corresponding points on the reconstructed 3D surface. ![](https://hackmd.io/_uploads/B1TLIFmTh.png) ## 3. Method ![](https://hackmd.io/_uploads/S1WXDK7an.png) * There are 3 main processes in this digitization pipeline: ### Training pipeline: * **Surface Reconstruction:** - step 1: The 2D image of the object is fed into the fully convolutional image encoder. This encoder is typically a neural network composed of convolutional layers that perform feature extraction. - step 2: The convolutional layers process the image in a series of convolutions, pooling, and activation functions. These operations progressively extract and emphasize different features of the image, such as edges, textures, and object parts. - step 3: As the image passes through the convolutional layers, feature maps are generated at different levels of abstraction. These feature maps contain encoded information about the image's content. - step 4: The implicit function predictor utilizes these encoded features to predict the implicit function values that determine the surface presence in 3D space corresponding to the image pixels - The implicit function: $f(F(x),z(X)) = s$, where f() is the implicit function, F(x) is a feature map from the encoder, z(X) is the value of depth axis of 3d points corresonding to x. Based on the input, we can calculate the s value to know that the 3d point is inside or outside of the surface. - We use MSE as a loss function with ground-truth surface as a 0.5 level-set of a continuous 3D occupancy field: $f_v^*(X) = \begin{eqnarray} \begin{cases} 1 , \text{if X inside the mesh surface}\\ 0 , \text{otherwise}\\ \end{cases} \end{eqnarray}$ We train a pixel-aligned implicit function (PIFu) $f_v$ by minimizing the average of mean squared error: ![](https://hackmd.io/_uploads/Hkog2qXp3.png) ![](https://hackmd.io/_uploads/HkPbiKXTn.png) * **Texture Inference:** - Texture inference involves predicting the color or texture information for the reconstructed 3D surface based on the 2D image. This step aims to create visually accurate and textured 3D models from the reconstructed surfaces. - step 1: The 2D image of the object is fed into the image encoder for texture inference with the image features learned for the surface reconstruction. - step 2: After that, the features are fed in to the Tex-PIFu to produce color values that correspond to the colors in the input 2D image. The texture information can be applied to the 3D geometry to create a visually realistic representation of the object. - The loss function for training is to use L1: ![](https://hackmd.io/_uploads/HJ-kpc76n.png) ![](https://hackmd.io/_uploads/SyBzkcm6h.png) ### Testing pipeline: * The 2d image goes to PIFu for creating 3D occupancy field, after that, the Marching Cube Algorithm is used to reconstruct the geometry, finally, it goes to Tex-PIFu to generate textured reconstruction. ![](https://hackmd.io/_uploads/SkMPfcXa3.png) * Marching Cubes Algorithm: - https://www.youtube.com/watch?v=M3iI2l0ltbE - https://www.cs.carleton.edu/cs_comps/0405/shape/marching_cubes.html ### implementation: * Since the framework of PIFu is not limited to a specific network architecture, one can technically use any fully convolutional neural network as the image encoder. For surface reconstruction, we found that stacked hourglass architectures are effective with better generalization on real images. The image encoder for texture inference adopts the architecture of CycleGAN consisting of residual blocks. The implicit function is based on a multi-layer perceptron, whose layers have skip connections from the image feature F(x) and depth z in spirit of to effectively propagate the depth information. Tex-PIFu takes FC (x) together with the image feature for surface reconstruction FV (x) as input ## 4. Results and Discussion ### Dicussion: * There is an upgrade of PIFu is PIFuHD which is produced higher resolution and it also introduce new pipeline, which we will explore in the future session ![](https://hackmd.io/_uploads/BkmzJjXp2.png) * There is also a state-of-the-art from google which called PHORHUM that can produce one of the best results: ![](https://hackmd.io/_uploads/HyyJesm63.png) ### Result: ![](https://hackmd.io/_uploads/BkwKljQ6h.png) ![](https://hackmd.io/_uploads/rJrvmoXT3.png) ## 5. Conclusion * While our texture predictions are reasonable and not limited by the topology or parameterization of the inferred 3D surface, we believe that higher resolution appearances can be inferred, possibly using generative adversarial networks or increasing the input image resolution. ## 6. Acknowledgments * Demo Code: [Demo of PIFuHD](https://colab.research.google.com/drive/1EJNJDw1bOSOVUYfX9tPrqHb9vUMmPXAj?usp=sharing) * Dataset: [Deepfashion](https://paperswithcode.com/dataset/deepfashion), [BUFF](https://paperswithcode.com/dataset/buff), [CAPE](https://paperswithcode.com/dataset/cape) * Github: [link](https://github.com/shunsukesaito/PIFu)