NeRF in the Dark: Reproducibility blog Group 62

# NeRF in the Dark: Reproducibility blog Group 62 * Miruna Betianu - 5632927 * Sjoerd Groot - 4694368 * Menno van Laarhoven - 4676750 * Jasper-Jan Lut - 4698207 The goal of this blog is to present our attempt at reproducing the paper ***NeRF in the Dark**: High Dynamic Range View Synthesis from Noisy Raw Images*. This reproduction is part of the course *Deep Learning* (CS4240) at the *Delft University of Technology*. At the end of this paper we hope to reproduce images similar the figure below ![](https://i.imgur.com/cubyi0t.png) ## Introduction Before diving into the reproduction of the paper it is important to introduce the foundations that the paper itself is built upon. First starting with Neural Radiance Fields (NeRF) followed by mip-NeRF. That is used as the basis for our paper on NeRF in the Dark. ### NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis The technique that our paper builds upon is Neural radiance fields, (NeRF). It is a technique that: “represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (the spatial location (x, y, z) and viewing direction (θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location”[1]. This technique allows for the creation of novel viewing angles not given as training data. The figure displays the pipeline of such a novel viewpoint creation. ![](https://i.imgur.com/nciaDvQ.png) ### Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields When training a NeRF scene the rendering procedure used only samples one ray per pixel. The downside of only using one ray per pixel is that it might cause excessively blurred and aliased results when a scene is rendered at different resolutions than the original training data. This paper expands the rendering procedure: “By efficiently rendering anti-aliased conical frustums instead of rays”, as it can be seen in the Figure below. This new technique significantly improves NeRF's ability to render fine details and removes a lot of aliasing artefacts. This results in a reduction of 17% in the error rates. ![](https://i.imgur.com/LOXNFG1.png) ### RawNerf: High Dynamic Range View Synthesis from Noisy Raw Images The downside of both NeRF and Mip-NeRF as with many other view synthesis methods. It uses tone-mapped low dynamic range (LDR) images as input; these images are processed such that the details are smooth and the highlights are clipped. Our reproduction paper proposes a technique that directly uses linear RAW images and in this way preserves the scene its full dynamic range. The result is a deep neural network that can reproduce the same and better results as NeRF network but in addition, because it uses noisy raw images it can outperform dedicated denoising networks that have been trained on a similar set of input images. ![](https://i.imgur.com/KqePQ3f.png)  ## Reproducibility approach. Our approach in reproducing this blog is nicely summarized in our [Gantt chart](https://docs.google.com/spreadsheets/d/1Tt0o0FRFFn8QbtoqBvdLMmI7OEd4l37GoZZgVVUeAIo/edit?usp=sharing). However to sum it up: we seperated the project into 3 phases: 1. *Project familiarization (completed)*. Dedicated to getting to know the material, the current baseline code and the available data. Followed by training the Mip-Nerf network on .jpg and .png image dataset. 2. *Project reproduction (partialy completed)* During this part of the project the changes of the authors would be implemented on the base code. Resuling in the ability to reproduce the results of the authors. This includes: 1. Raw image creation 2. Implement Loss function. 3. Implement new activation function. 4. Implement the post processing pipeline. 3. *Optional* (not completed)* The final step is to perform hyperparameter tuning and evaluating the changes. ### (Un)Availabe data A dificulty that is encountered during the reproduction attempt is the unavailability of both the model code and a RAW image dataset. Therefore in this project we create a new implementation on the Mip-NeRF base code that uses the changes described by the Raw-NeRF paper. To train the network we generated a new Raw image dataset both synthetic and real life. This will be discussed in the following sections. ## Run the baseline code When attempting to run the initial base code the paper was built upon there are 2 options. First option is to run it locally on a PC or laptop. The second is to run it in a cloud computing platform like Google Cloud. Both have been attempted during the reproduction process, cloud version being our final choice. ![](https://i.imgur.com/8xKyBA7.jpg "https://www.intellithought.com/pros-cons-cloud-storage-vs-dedicated-servers-part-2/") ### Running Mip-NeRF locally. In an attempt at installing MipNeRF locally, the steps in the README file provide some guidance. The result is all dependencies being installed as it should be. At this point, Mip-NeRF will run on your local machine however more likely than not it won’t be able to use your GPU computing power right out of the box. This was the case for our machines. When running on the CPU the net was only able to train with ~300 rays/s. Throughout the timeline of the project, several attempts were made to correctly install GPU support including a fresh Linux install. In the end, the closest result was obtained using the following blog post: https://markjay4k.github.io/Install-Tensorflow/ At this point, the GPU would run but at a later stage in the training process the program would still crash when generating a check point. From this smaller run it was observed however that an Nvidia GTX1050Ti produces around ~1000 rays a second and a Quadro M1200 ~800 rays a second. Causing an estimated computing time of more than 5 days for sufficient epochs. Unmanageable therefore the google cloud option was investigated. ### Google cloud When using Google Cloud, the initial step was to setup a JupyterLab notebook instance equipped with an Nvidia T4 GPU and four CPU cores each with 16GB of RAM. CUDA version 11.0 was used. To make this work, we needed to enable Compute Engine and Notebook APIs for our Google Cloud Project and then seek an increase in the region's GPU quota from 0 to 1. After the request was approved, we could proceed with setting up the complete code base, at which point we encountered a mismatch between the Nvidia libraries and TensorFlow. This was resolved by reducing the TensorFlow version, which restored stable functioning to the system. The training process was significantly accelerated when using this notebook instance, reaching approximately 3000 rays/s. This was faster, but not fast enough, as running 1 000 000 optimization steps would take days. The second strategy was to employ TPUs rather than GPUs. According to the authors of the original Nerf paper, training time is reduced from days to 2 hours and 30 minutes using a v2-128 TPU. After comparing the prices of the TPUs, we determined that we could not afford them for more than a few hours, so we applied for the TPU Research Cloud, which provided free access to certain versions of the TPU in certain regions. Our request was approved within two days, and we were able to begin creating a VM instance with a v3-8 TPU. The virtual machine instance is a headless Linux server running Ubuntu 20.40, which you may connect to through SSH from a terminal on your local PC. Our environment had to be recreated, as well as Anaconda, CUDA, and cuDNN installed. After the setup was complete, we discovered a more efficient method of transmitting the dataset from the local PC to the cloud VM: using SCP. The performance achieved with eight v3TPU cores exceeded our expectations, significantly reducing training time from days to a few hours. The average number of rays produced per second was 15 000. On the original MIP Nerf, training for 1 000 000 took approximately 6 hours. Additionally, the server included four CPU cores as a fallback, which used all of the Google Cloud account's credits for no apparent reason. This created a larger issue; without some credits, even the free items become inoperable, and our system failed and went into 'hidden' mode. After resolving the billing issue, the node began 'unhiding,' a process that continues after one week. We attempted to contact the Google Team for assistance in resolving this issue, however they did not respond promptly and the solution they supplied was ineffective. Nonetheless, we considered retrying the whole process using a different server. The process was not as seamless as anticipated; we were unable to create servers in any of the regions supported by the TPU Research Cloud. The only version that worked was downgrading to v2 TPUs, which are more stable but less performant than v3 TPUs. We repeated the previous VM's steps and obtained 10,000 rays/s, but after a time, JAX failed to recognize the TPU cores and we were forced to continue our research on the GPU. As final marks, we experimented with a variety of GPUs and CUDA versions, but the combination that worked best for us was CUDA 11.0 and a Tesla T4 GPU, which is a budget-friendly solution. We strongly recommend using many TPU cores rather than GPUs while training with JAX, as this results in a faster training time. ### Mip-NeRF results When the baseline code for Mip-Nerf was finally up and running the following result was produced with the blender lego dataset for 100 000 epochs. and the Rex from the LLFF dataset after 20 000 epochs. Where the left is the rendered model and the right is the baseline image. Observe that for descent results the model needs to have trained for a large number of epochs. Altough a resemblence of the original image can be observed in the Rex image. ![](https://i.imgur.com/FRB7hnt.jpg) ![](https://i.imgur.com/N1H6ZCZ.png) ## Creating our own dataset As the original Raw NeRF paper did not provide their dataset, a new dataset had to be created to test our Raw NeRF recreation on. This dataset constisted of two parts: synthethic data and real world captured images. ### Synthethic data For the synthetic dataset the lego blender scene in the blend_files.zip from the original nerf paper was used. https://drive.google.com/drive/folders/128yBriW1IG_3NJ5Rp7APSTZsJqdJdfc1 ![](https://i.imgur.com/HrFsL04.png) The blender scene provided by NeRF already has a script to render images in an automated fashion. By chaning the format to `'OPEN_EXR'`, reducing the resolution and adding the right results path we could run the script and render the scene. Do note that as of version 2.92 of Blender, the UI will stop responding until a script has finished running. In this case where it takes quite long before the images are all rendered, it could look like Blender crashed but that is just fine. Once exported the images where split into even and odd set for train and test data so that the postions where interleaved. In hindsight, it might had been more convenient to store the images in a *\*.tiff* format as then *rawpy* could have been used to load both the camera *dngs* and these blender images. ### Real world images Two real world datasets where captured to test on. One of a castle room and one of a plant with christmas lights. The castle room dataset constists of 31 images all shot with the same ISO, shutterspeed, and white balance. For the christmas plant the ISO, white balance and focus where fixed but the shutter speed was varied. 10 images of the plant where taken from the same position and from 8 different positions 3 different exposure time pictures where taken, resulting in a total of 35 pictures. For all pictures both the raw image and the *jpg* image were captured. | Plant with christmas lights | Castle room | | -------- | -------- | | ![](https://i.imgur.com/x7C5YH7.jpg) | ![](https://i.imgur.com/5jZjSbM.jpg) | #### Reconstructing camera poses As NeRF needs to have the camera poses to train on, these had to be reconstructed from the images. To do this, the *imgs2poses* script from LLFF was used. Make sure to have installed the prerequisite dependencies before running the script. `pip install -r requirements.txt` `sudo apt-get install colmap` After all dependencies are installed we could theoretically reconstruct the images now by running: `python imgs2poses.py path_to_folder` Also make sure that the images are inside a folder called 'images' in the folder that you are pointing to. As this process does not work on raw images we used the *jpg* images for the pose reconstruction. One difficulty with the process is that the main branch of LLFF does not work when *colmap* could not find the position of every single image. We are dealing with very dark images, this will likely cause the pose estimation to fail. Luckily there is [an pull request by starhiking](https://github.com/Fyusion/LLFF/pull/60) that alleviate this probelm. With this version of the *pose_utils*, the poses that could not be reconstructed are ignored and we can proceed as usual with the image that where properly identified. For the christmas plant this meant we had to drop the darkest images and for the castle dataset all images where identified. #### Resizing images As training on 4000x3000 pixels images would take too long and use too much memory the images where scaled down by 4x and 8x using Adobe DNG converter. For the raw images that did mean that the mosaicing was lost. ## Implementation ### Photo loader (Raw and Blender) The Mip-NeRF base code provides the network to train on 3 different types of datasets. The first one is the *nerf_synthetic* dataset that is created in Blender. A famous example is the lego model. Secondly there is the *nerf_llff* datasets containing real life images. The datasets are loaded and processed using dedicated dataset loader Classes accepting both *.jpg* and *.png* file formats. We expand on the existing structure by the creation of 2 new Classes for loading both synthetic RAW data that is created using Blender and LLFF RAW data. The networks now accepts both *.exr* (left) and the *.dng* (right) file formats. ![](https://i.imgur.com/FOQyL0b.png) ### Activation function The information in the network should allow the network to learn values in a range larger than the $[0,1]$ range used in the original Mip-NeRF implementation. If values saturate, they should be truncated to a maximum brightness of $1$, but the learned network should have the information for a larger dynamic range (beyond $1$). Hence we replace the original sigmoid output activation layer to an Exponential Linear Unit (ELU). This allows the network to learn for a larger dynamic range, while maintaining a small gradient at the small values to prevent the dying ReLU problem. ### Loss function The original loss function in standard Nerf applications is usually given \\[ \hat L(y, \hat y) = \sum_i \left( \hat y_i - y_i \right)^2. \\] At each pixel $i$ the predicted rgb value $\hat y_i$ is compared to the target value $y_i$. For images with a large dynamic range this means that the darker parts of an image have a relatively smaller error compared to the error between two bright pixels. In this paper they apply a tone map $\psi(y) = \log(y + \epsilon)$ to effectively normalize this difference. This tonemap is applied to both the predicted value $\hat y$ and the target value $y$. Linearizing this equation yields the following loss function \\[ \hat L(y, \hat y) = \sum_i \left( \frac{\hat y_i - y_i}{\text{sg}(\hat y) + \epsilon} \right)^2. \\] Here sg() represents a *stop gradient*, in order to let the tone map only act as a relative weight, rather than a contribution to the loss. This is needed to converge to an unbiased result. An additional hyperparameter $\epsilon$ is introduced to prevent division by zero, bounding the loss function. The resulting code becomes ```python= loss = (mask * ((rgb - batch['pixels'][...,:3]) / jax.lax.stop_gradient(rgb + config.supervision_constant))**2).sum()/ mask.sum() ``` ### Exposure Time Correction The strength of training a RAW representation of the target image stems from the fact that you can capture a higher dynamic range of the target. This means that our dataset should reflect this by capturing the target from a range of exposure times. To learn the dynamic range of the model we apply exposure time correction on the result of the learned model \\[ \hat y_c = \min \left( \hat y t, 1 \right). \\] In this expression $t$ represents the exposure time of the target image. It was found that the initial configuration of the network now resulted in predicted colour values larger than 1. As a result, during training the backpropagation of these predicted values would be neglected, preventing the model to learn. Therefore we decided to remove the $\min()$ from the correction during training. At the worst this will result in the overexposed (white) areas becoming slightly darker, because it will be trained towards an average representation of the exposures. ### Post-processing As is described in the paper appendix D.2. A post processing procedure is used to translate the raw images to postprocessed sRGB space. For all the steps we refer the reader to the Raw NeRF paper. We implemented the full post-processing pipeline as describe by appendix D in *post_processing_pipeline.py* An example: the code snippit of step 9 is given below. $$z\leftarrow \gamma_{sRGB}(z)$$ ```python= def __apply_rgb_gamma_curve(z): temp_z = np.zeros(3) for i in range(0, 3): if (z[i] <= 0.0031308): temp_z[i] = 12.92 * z[i] else: temp_z[i] = 1.055 * np.power(z[i], (1 / 2.4)) - 0.055 return temp_z ``` ## Chaining it all together In this section we present the obtained result after implementing all the modifications proposed by the Raw-NeRF paper. ### Hyper parameters While training the network the following hyper parameters have been used. These are heavyly inspired by the choises already made by the Mip-NeRF paper. The reason for this is that a hyperparameter tuning is outside the limits of our compute power. A single cycle takes 10 hours. The parameter that was optimized by us is the *batch_size* this is scaled such to maximize the GPU memory usage on the google cloud network. | Hyperparameter | Value | Info | |--------------------|----------------|-----------------------------------------| | Batching | "Single_image" | Batch composition. | | Batch_size | "2048" | The number of rays/pixels in each batch. | | Learning rate | 1e-3 to 1e-5 | The initial and final learning rate. | | Optimization steps | 100.000 | The number of optimization steps. | | llffhold | 32 | Use every Nth picture for the test set. | | Supervision constant $\epsilon$ | 1e-3 | Supervision constant to prevent division by 0 in the loss function. | ### Result The original goal this paper set out to tackle was a reproduction of the results given in the introduction. Where a set of RAW noisy input images results in a network that is able to very accurately remove the noise and generate a novel viewpoint. However as can be seen in the following table. Our network did not converge to a noise free image yet. Several more training days are required. The table shows the noisy ground truth data on the top left and the output of the learned model every 10.000 epochs. We find that the model initially mostly learns to converge to an average color of the picture. ![](https://i.imgur.com/7p7w2df.jpg) |Ground truth |10K |20K | |-|-|-| |![](https://i.imgur.com/2uHh0ns.png)| ![](https://i.imgur.com/PNK9SUq.png)| ![](https://i.imgur.com/ug1ESfK.png)| |30K|40K |50K | | ![](https://i.imgur.com/qLLOuTI.png)|![](https://i.imgur.com/RGoX8Uo.png) |![](https://i.imgur.com/LHW7iox.png)| |60K |70K |80K | | ![](https://i.imgur.com/DqssmWr.png)| ![](https://i.imgur.com/u7F4ihN.png)| ![](https://i.imgur.com/NCCeh2w.png)| | 90K | 100K| | | ![](https://i.imgur.com/awZzCRH.png)|![](https://i.imgur.com/UsQ7hui.png)| | ### PSNR When evaluating the resulting images the PSNR equals 16.6. When comparing this to the tabel provided in the Raw-NeRF paper it can be observer that our value is significantly lower than the result the authors have obtained. In addition we are above the authors noisy base line. This makes it unclear if this is a 1 to 1 comparison. Because visually the by us generated images do not look noise free. However as we do not have acces to the same image as the original authors used this is the best result we can provide. ![](https://i.imgur.com/naZKssD.png) ## Discussion ### Implementation accuracy We did not implement every step discussed in the original RawNerf paper due to incompleteness of information and the limited scope of the project. - Camera Shutter speed miscalibration. As a result we expect the brightness representation of the network output to vary for each colour, which will result in a perceptible colour shift of the image. This might be part of the reason of the percepted green colour shift in our results. - Weight Variance Regularizer. According to the paper this might introduce transparent "floater" artifacts in the output image. We are not yet able to observe these results in our output, but those might emerge if the network is trained further. - Activation function of the output layer. In the paper they state that they used an *Exponential* activation function. It is unclear to what they refer with this, since e.g. a standard exponential $\exp(x)$ would compromise important properties of the network. It would for example be very difficult to obtain a 0 output value ($x \rightarrow -\infty$). Additionally this might introduce issues with very large gradients during backpropagation. - In the exposure time correction we had to remove the $\min$ function they describe in the paper to allow the model to learn. They state nothing about this in the paper, suggesting a misinterpretation in our implementation. - Gradient Clipping. The original paper mentioned that they had to implement gradient clipping in order to stabilize the training of the network (since it would often return $\texttt{NaN}$ values). ### Dataset limitations We quote the following from the original paper: > RawNeRF itself is prone to reconstruction artifacts in very noisy scenes or scenes captured with few images (under 30), typically in the form of positional encoding grid-like artifacts Our dataset consists of 9 different perspectives. To investigate this issue we present our output depth map, which should relate to the positional encoding in the network. ![](https://i.imgur.com/PPeHnLc.png) A grid-like disturbance can be observed in this image. For further work it is therefore recommended to take into account a minimum perspective requirement for dataset recreation. ### Post processing pipeline When feeding the resuling raw image generated by our model through our post processing pipeline the following sRGB image was generated. Clearly this is not the desired result. This step is therefore now ommitted. ![](https://i.imgur.com/ZkaevrS.png) ## Division of work | Whom | What | |------------|--------| | Miruna Betianu | Environment Setup: Google Cloud TPU & GPU | | Sjoerd Groot | Dataset creation | | Menno van Laarhoven| Loss function, shutterspeed correction, activation layer, Raw Image loading | | Jasper-Jan Lut | Postprocessing, Image data extraction, Raw Image Loading | # References [1]: B. Mildenhall, P. Hedman et al. *NeRF in the Dark: High Dynamic Range View Synthesis from Noisy Raw Images.* CVPR (2022) [2]: J. T. Barron, B. Mildenhall et al. *Mip-Nerf: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields.* (2021) [3] B. Mildenhall, P. P. Srinivasan et al. *Nerf: Representing Scenes as Neural Radiance Fields for View Synthesis.* ECCV (2020) [4]: Mildenhall, Ben, et al. *Local light field fusion: Practical view synthesis with prescriptive sampling guidelines.* ACM Transactions on Graphics (TOG) 38.4 (2019): 1-14.