# [summary] Occupancy Networks: Learning 3D Reconstruction in Function Space (2019) \[[arxiv](https://arxiv.org/abs/1812.03828), [papers with code](https://paperswithcode.com/paper/occupancy-networks-learning-3d-reconstruction), [supplement](http://www.cvlibs.net/publications/Mescheder2019CVPR_supplementary.pdf)\] ## Summary ### Approach Instead of reconstructing a 3D shape in the form of a (discrete) voxel grid, point cloud, or mesh from the input data, occupancy networks return a function that predicts an occupancy probability for a continuous 3D point in R<sup>3</sup>: <img src="https://i.imgur.com/BqKo5zX.png" alt="occupancy function" style="width:120px; margin-top:2px;"/> The trick is the following: a function that takes an observation **x** as input and returns a function mapping a 3D point **p** to the occupancy probability can be replaced by a function that takes a pair (**p**, **x**) and returns the occupancy probability (see also: [uncurrying](https://en.wikipedia.org/wiki/Currying)): <img src="https://i.imgur.com/aUvzczL.png" alt="occupancy function uncurried" style="width:160px; margin-top:3px;"/> #### Limitations of existing 3D data representations - **Voxel grids:** memory-heavy, up to 128<sup>3</sup>–256<sup>3</sup> maximum resolution - **Point clouds:** need post-processing to generate a (mesh) surface - **Meshes:** existing approaches often require additional regularization, can generate only meshes with simple topology, need the same class reference template, or cannot guarantee closed surfaces #### Pipeline - **Training:** the network f<sub>θ</sub>(**p**, **x**) takes as input the task specific object (for example, for single image 3D reconstruction this would be an RGB image) and a batch of 3D points, randomly sampled from the ground truth 3D representation. For each 3D point, the network predicts the occupancy probability which is then compared with the ground truth to compute the mini-batch loss: <img src="https://i.imgur.com/dssY7pd.png" alt="mini-batch loss" style="width:280px; margin-top:0px;"/> - **Inference:** to extract a 3D mesh from the learned f<sub>θ</sub>(**p**, **x**), the paper uses a *Multiresolution IsoSurface Extraction (MISE)* algorithm, which as a first step is building an octree by progressively sampling in the points where neighbors occupancy predictions do not match. After that, the [Marching Cubes algorithm](https://en.wikipedia.org/wiki/Marching_cubes) is applied to extract the mesh surface. Finally, the mesh is refined even further using the [Fast-Quadric-Mesh-Simplification algorithm](https://dl.acm.org/doi/10.5555/288216.288280). <img src="https://i.imgur.com/4Bmj4j6.png" alt="multiresolution isosurface extraction" style="width:410px; margin-top:0px;"/> #### Network architecture - The network architecture is generally the same across different tasks (e.g. single image 3D reconstruction or point cloud completion) with the task-specific encoder being the only changing element. <img src="https://i.imgur.com/HnVg3xN.png" alt="network architecture" style="width:550px; margin-top:0px;"/> - Once the task encoder has produced the embedding **c**, it is passed as input along with the batch of T sampled 3D points, which are processed by 5 sequential ResNet blocks. To condition the network output on the input embedding **c**, [Conditional Batch Normalization](https://paperswithcode.com/method/conditional-batch-normalization) is used: <img src="https://i.imgur.com/zk1csJh.png" alt="Conditional Batch Normalization" style="width:200px; margin-top:0px;"/> - For single image 3D reconstruction the network uses a ResNet-18 encoder with altered last layer to produce 256-dim embeddings. <img src="https://i.imgur.com/p6S1Mtm.png" alt="single image encoder" style="width:800px; margin-top: 2px; margin-bottom:10px;"/> - For point cloud completion a modified version of the PointNet encoder is used. <img src="https://i.imgur.com/Hyfh8Y2.png" alt="point cloud encoder" style="width:800px; margin-bottom:12px;"/> - For voxel super-resolution the network uses a 3D CNN encoder, which encodes a 32<sup>3</sup> input into a 256-dim embedding vector. <img src="https://i.imgur.com/UZcDOzH.png" alt="voxel grid encoder" style="width:600px; margin-bottom:10px;"/> ### Experiments - **Single image 3D reconstruction:** <img src="https://i.imgur.com/7lEnENh.png" alt="single image 3D reconstruction" style="width:400px; margin-bottom:10px;"/> - **Point cloud completion:** <img src="https://i.imgur.com/Fi0khPD.png" alt="point cloud completion" style="width:600px; margin-bottom:10px;"/> - **Voxel super resolution:** <img src="https://i.imgur.com/RDEpTvE.png" alt="voxel super resolution" style="width:600px; margin-bottom:10px;"/> - **Generalization to the Pix3D dataset** (unseen data)—could be better: <img src="https://i.imgur.com/QfT9liU.jpg" alt="generalization to Pix3D" style="width:600px; margin-top:10px;"/> ### Metrics - Comparison against [3D-R2N2](https://arxiv.org/abs/1604.00449), [PSGN](https://arxiv.org/abs/1612.00603), [Pix2Mesh](https://arxiv.org/abs/1804.01654) and [AtlasNet](https://arxiv.org/abs/1802.05384) on the [ShapeNet](https://shapenet.org/) dataset. <img src="https://i.imgur.com/yoFgYvI.png" alt="comparison on the ShapeNet dataset" style="width:800px; margin-top:10px;"/> <br/> <small>Used diagrams are taken from the paper with minor changes.<br/>All errors present in the interpretation are my own.</small>