Scaling Down NeRF with NeRF2D

# Scaling Down NeRF with NeRF2D In this blog post, we present **NeRF2D** a 2D analogue of the NeRF task that facilitates experimentation with Neural Radiance Fields. ![Page 1](https://hackmd.io/_uploads/Byy4Bc2S0.png) _We show that we can reformulate NeRF in 2D by reconstructing a 2D shape from 1D views of it. Fitting a 2D NeRF is very fast and we propose this as a viable toy dataset for performing quick experimentation on NeRF_ # Introduction Neural Radiance Fields (NeRF) [^nerf] have recently shown impressive results for the task of novel view synthesis, and since its introduction, there has been an explosion of follow-ups (nearly 6k citations!). Many of these follow-ups build on the base method, by introducing small improvements, such as alternative loss functions, scene representations, or other small changes. As you may know, most of these works validate their methods by empirical experimentation. However, this can be incredibly computationally demanding, as **training a single NeRF can take up to 48h of GPU** time [^nerf], which can become prohibitively expensive, especially for small-scale experimentation. Taking inspiration from MNIST1D [^MNIST1D], a small, 1D version of MNIST that enables very quick experimentation. We propose **NeRF2D**, as a conceptually, and computationally simpler version of NeRF designed for quick experimentation. Our key observation is that NeRF can be reformulated in 2D, where we use 1D images to learn a neural representation of a 2D scene. This dimension reduction drastically decreases the computational time, **allowing us to train 2D NeRFs in under a minute on a consumer laptop GPU!**. We validate our method by reproducing two follow-ups to NeRF in NeRF2D, namely Depth-NeRF [^depth] and pixelNeRF [^pixel]. We obtain very similar results to those shown in the paper, but using considerably less computation. This indicates that our toy task can be used for performing quicker and small-scale experimentation with NeRFs. ## Background: NeRF Before introducing NeRF2D, we recap NeRF in more detail, as it is useful to have the notation in mind, for reducing it to 2D. NeRF aims to learn how to render novel views of a scene, using only a sparse set of views. This is achieved by learning an underlying 3D representation of the scene, which is optimized by rendering it from the known poses and using the loss between the render, and the ground-truth as supervision. ![image](https://hackmd.io/_uploads/Syx0bXsHC.png) NeRF represents a 3D scene as a multi-layer perceptron (MLP) $F_\Theta$ that maps a position $\mathbf{x}=(x,y,z)$ and viewing direction $\mathbf{d}=(\phi, \theta)$ to a color $\mathbf{c}=(r,g,b)$ and volume density $\sigma$, where the weights $\Theta$ are the parameterization of the scene: $$ F_\Theta: (x,y,z,\theta,\phi) \to (r,g,b,\sigma) $$ Given this representation, we can render it from a given view direction by casting a ray for each pixel and querying the neural representation for multiple samples along the ray, which are then aggregated to a color using traditional volume rendering. By expressing this rendering procedure in a differentiable way, we can then use the MSE between the rendered view and ground truth view as a loss to optimize the weights $\Theta$ of the representation. ![image](https://hackmd.io/_uploads/SyvAvMirR.png) ## NeRF2D: Throwing away the $z$ dimension Now that we have refreshed NeRF, NeRF2D follows naturally by simply dropping the $z$ dimension. Instead of having a set of 2D training views as our dataset, each with a 3D position and 3D rotation, we have a set of 1D training views, each with a 2D position and angle: ![Page 1](https://hackmd.io/_uploads/Byy4Bc2S0.png) And for our model, instead of taking a 3D coordiante $x,y,z$ and view direction $\phi,\theta$, we take a 2D coordinate $x,y$ and a single angle $\theta$ as input to our neural volume: ![Page 3](https://hackmd.io/_uploads/r1y0I9nSR.png) Besides this, the training and rendering procedure remains completely unchanged. ### Generating NeRF2D datasets Now that we have shown how NeRF can be formulated in 2D, we need an actual multi-view dataset to train on. To enable quick experimentation, we created a blender script for creating NeRF2D datasets. Citing from Scaling Down Deep Learning [^MNIST1D], the paper that proposed the MNIST1D dataset: > The ideal toy dataset should be procedurally generated to allow researchers to vary its parameters at will The script makes it easy to place cameras in a circle around an object, and vary the radius, camera parameters, and add random noise to the distance, angles, and positions: ![Peek 2024-06-16 19-01](https://hackmd.io/_uploads/Bk0oq9hBA.gif) To ensure we are rendering in 2D, we place all cameras and lights at $z$-coordinate 0 and render at a resolution $H \times 1$. We can then create our dataset by rendering the scene from all cameras and saving their transforms. Throughout our experiments, we used four scenes, modeled as closed 2D Line segments: ![image](https://hackmd.io/_uploads/BybqGA3H0.png) And for all scenes use the same training, testing, and validation views, however, our script makes it easy to procedurally generate new datasets, by simply modeling the 2D shape of interest and creating a dataset of views. We used randomly sampled and slightly noisy views for training and validation and evenly spaced cameras for testing. ![image](https://hackmd.io/_uploads/Hy5vv03BC.png) Since each training view is just a 2D line, we found it helpful to visualize all of the views together by concatenating each view horizontally and plotting them together. For the "square" Scene we get the following views (each column is a view) ![image](https://hackmd.io/_uploads/H1PEF02H0.png) For completeness, we show the same views as our other datasets. Interestingly "Square Convex" looks very similar, with some minor appearance variation at the center: ![image](https://hackmd.io/_uploads/H1vaKChr0.png) And the "Bunny" and "TU" scenes: ![image](https://hackmd.io/_uploads/r1GtKA2HC.png) ![image](https://hackmd.io/_uploads/SJWiYR3HC.png) ### Training a 2D NeRF Now that we have datasets, we can start training 2D NeRFs! To do this we re-implemented the base NeRF architecture in Pytorch Lightning, focusing on readability and extensibility to enable easy experimentation for testing out features. With our implementation, we can fit a 2D NeRF using 50 views of resolution 100 in under a minute. We show the reconstructed testing views after training each scene: ![image](https://hackmd.io/_uploads/SJwzC0nH0.png) Additionally, since we are working in 2D space, we can visualize the density field by simply uniformly sampling $x,y$ coordinates and querying the density field over space, enabling us to visualize the reconstructed geometries: ![image](https://hackmd.io/_uploads/Sy9u0AnHC.png) ## Experiments Now that we have shown our base NeRF2D framework and datasets, we show how we can use it to perform small-scale experimentation: ### Positional Encoding In NeRF, a critical component to their success was the use of **positional encoding**. The spectral bias of neural networks makes it difficult for them to express high-frequency spatial functions. They found that a simple solution is to pass the coordinates through a positional encoding $\gamma$ before feeding it to the MLP: ![image](https://hackmd.io/_uploads/HywmuNiS0.png) Where $\gamma(p)$ is a series $L$ of alternating sines and cosines of $p$ with exponentially increasing frequencies, where $L$ is a hyperparameter. In NeRF they found this critical, only fitting a blurry version of the scene without it: ![image](https://hackmd.io/_uploads/BkdSPNiHA.png) We validated this in NeRF2D, with the "Bunny" scene and unsurprisingly found that without positional encoding $(L=1)$ the learned density field is very low frequency, but as we increase $L$ we can fit higher frequency signals, additionally leading to increased PSNR: ![image](https://hackmd.io/_uploads/r1nCKEirA.png) ### Depth Supervision ![image](https://hackmd.io/_uploads/r1bSnEsrA.png) It has been observed that training NeRFs can be challenging given only a few training views. The model fails to learn the underlying geometry and the quality of the rendered results on unseen views suffers. As a solution, Depth-supervised NeRF [^depth] utilizes additional supervision from provided depth information, This requires that the input dataset includes, along with the RGB values of each pixel, the corresponding depth values, but such a depth map can be approximated in the structure from motion process typically used to estimate the camera poses. The proposed depth loss is the KL divergence between the predicted ray-termination distribution $h(t)$, and a 1D Gaussian normal, centered at the ground truth depth $D$ with standard deviation $\hat\sigma^2$: ![3E9F5232-E137-4C87-93FF-2BEEB0B8A6DC](https://hackmd.io/_uploads/rJMVGBiBA.png) Incorporating this term into the loss ensures that the distribution of termination points along the ray aligns with the known depth information, thereby enforcing accurate geometry reconstruction. The visual results from the DS-NeRF paper show significant improvements in rendering quality when depth supervision is used, especially when training on sparse input views: ![image](https://hackmd.io/_uploads/BJQt7ChHA.png) We reproduce the paper in NeRF2D by extending our dataset generation to also render ground-truth depth maps and incorporating the loss. We show our results for the "Cube scene". We show the test renders, test depth, and also the visualized density field for NeRF and DS-NeRF with different levels of training view-sparsity, namely 2, 5, and 10 test views. As expected, depth supervision greatly increases the visual quality and reconstructed geometry, especially when using sparse training views: ![cube-min](https://hackmd.io/_uploads/BkAkPinSC.png) The results are even more evident on the more complex "Square Convex" scene, here NeRF is unable to reconstruct the convexity but DS-NeRF, even with sparse training views accurately reconstructs the geometry: ![cube_convex-min](https://hackmd.io/_uploads/ByNJPjnBA.png) Finally, we show our results for the "Bunny" and "TU" Scenes, where the effect of depth supervision is even more evident. Notably, Depth Supervised NeRF can accurately reconstruct even the complex geometry of the TU Scene: ![bunny-min](https://hackmd.io/_uploads/B1M0Uo3SR.png) ![tu-min](https://hackmd.io/_uploads/HkiQDonHC.png) ### Pixel Space Features Similarly to Depth NeRF, pixelNeRF [^pixel] was introduced to improve NeRF's capability of dealing with a small number of images. However, unlike DepthNerf, pixelNerf does not use any additional information. PixelNeRF utilizes a trained ResNet to extract features from input images. When querying the network, the queried position is projected to all the input cameras to find the corresponding features. These features, along with the position and direction, are fed into the NeRF network. Implementing pixelNeRF2D presents a challenge due to the lack of ResNets for 1D images. We addressed this by utilizing a CNN with randomly initialized weights. Reproducing pixelNeRF with NeRF2D was straightforward otherwise. PixelNeRF demonstrates superior learning capability with few views compared to NeRF, as illustrated in the figure below: ![image](https://hackmd.io/_uploads/HJSerI3rR.png) Furthermore, when testing pixelNeRF on the DTU dataset, it showed a clear advantage over NeRF when using only 1 or 3 views. However, the performance was comparable when using 6 views, and NeRF even performed better than pixelNeRF with 9 views. This can be seen in the table below: ![image](https://hackmd.io/_uploads/ByE0oI3rR.png) To show NeRF2D's ability in being a toy task, we compared pixelNeRF2D's and NeRF2D's capability of learning a scene on a few and even one input views. The figures below compare pixelNeRF2D to NeRF2D on 1,3,5 and 9 views on the bunny and TU logo scenes: ![image](https://hackmd.io/_uploads/H19Ep8hrA.png) ![image](https://hackmd.io/_uploads/HkndpU3H0.png) Comparing pixelNeRF2D and NeRF2D on various views of bunny and TU logo scenes, we observe similar trends to the original pixelNeRF paper. PixelNeRF2D outperforms NeRF2D with fewer views but shows worse performance with greater numbers of views. The superior performance of pixelNeRF2D is evident both visually and in higher PSNR scores. Being able to reproduce the results of pixelNeRF with a heavily reduced computational time, shows the value of NeRF2D as a toy task. ## Conclusion In conclusion, we have presented NeRF2D, a simplified task designed to enable quick experimentation in Neural Radiance Fields. We provide a blender script to enable quick dataset generation and a Pytorch Lightning implementation of NeRF2D that enables training 2D NeRFs in a matter of seconds on a laptop GPU. We then used our implementation to reproduce results from two follow-ups, indicating that our toy scenario is representative of the more complex 3D scenario. We believe NeRF2D's simplicity and computational efficiency hold promise as a simple "playground" for experimenting with NeRF. ### References [^pixel]: Yu, A., Ye, V., Tancik, M. & Kanazawa, A. (2021). pixelNeRF: Neural Radiance Fields from One or Few Images. V3. https://doi.org/10.48550/arXiv.2012.02190 [^depth]: Deng, K., Liu, A., Zhu, J. & Ramanan, D. (2021). Depth-supervised NeRF: Fewer Views and Faster Training for Free. V2. https://doi.org/10.48550/arXiv.2107.02791 [^nerf]: Mildenhall, B., Srinivasan, P., Tancik, M., Barron, J., Ramamoorthi R. & Ng, R. (2020). NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. V2. https://doi.org/10.48550/arXiv.2003.08934 [^MNIST1D]: Greydanus, Sam. "Scaling down deep learning." arXiv preprint arXiv:2011.14439 (2020).