## RawNeRF - Blog Post Group 80
**Group members**
* Rick Huizer (4847652, r.m.huizer@student.tudelft.nl)
* Eduard Klein Onstenk (4930894, e.t.kleinonstenk@student.tudelft.nl)
* Devin Lieuw A Soe (4933028, d.l.lieuwasoe@student.tudelft.nl)
* Thomas Markhorst (4934210, t.c.markhorst@student.tudelft.nl)
### Introduction
Neural Radiance Fields (NeRF) is a technique that is used to synthesize novel views of a scene with a set of images of the scene and their camera positions as input. However, classical NeRF uses low dynamic range (LDR) images as input. These images went through a postprocessing pipeline that smooths detail and clips highlights. This postprocessing results in the full dynamic range of the scene not being preserved and the noise distribution is being distorted. By training NeRF with noisy raw input images, a scene can be reconstructed in linear high dynamic range (HDR) color space. This method, RawNeRF, is experimented with in the work of B. Mildenhall et al. (2021) [1]. RawNeRF seems to be able to effectively deal with the noise within the input images as it is robust to the simple distribution of the noise that will be disturbed when postprocessing. Because of this, RawNeRF can render HDR novel views of scenes from noisy input images with limited lighting. Figure 1 shows a comparison of the pipelines of NeRF and RawNeRF.

_Figure 1: The pipelines of both classical NeRF (top) and RawNeRF (bottom)_
## Research questions
In the paper of Ben Mildenhall et al, [1], an ablation study is described in which several different shutter speeds are simulated to generate input images with different amounts of noise. Essentially, the shorter the duration that the shutter is open, the higher the amount of noise in the captured image. This is because the combined distribution of shot and read noise is dependent on the duration of the shutter being open. When this duration is infinite, the resulting image will not contain any noise.
The results of this experiment show that the rendered image when running RawNeRF on the ‘synthetic _Lego_ scene data’ has a peak signal-to-noise ratio (PSNR) of 36.85 in the case where an infinite shutter time is simulated for the input (Table 3). In this reproducibility project, an attempt is done to reproduce this number. Therefore, the first research question is:
_'To what extent can the result of the experiment with RawNeRF trained on noiseless input images be reproduced?'_
Since RawNeRF is proven to also work well with noisy input images, we were interested in the influence of the amount of train images that RawNeRF is trained with. Hence, the second research question of this project is:
_'How well does the denoising functionality of RawNeRF perform when training with different training set sizes?'_
Furthermore, the RawNeRF paper states that the experiments are done with a batch size of 16k. Because we are curious as to what the influence on the resulting renders is on using a different batch size, as proposed in the documentation of the mip-NeRF implementation [2], we pose the question:
_'How do different batch sizes used for training influence the resulting HDR renders of RawNeRF?'_
## Related work
**Mip-NeRF**
Mip-NeRF [2] is an extension of the classical NeRF method. Instead of rays, it renders anti-aliased conical frustums. By doing so, mip-NeRF addresses the objectionable aliasing of NeRF. The rays that are cast by classical NeRF encode the positions of points along them, whereas the cones mip-NeRF casts encode both the positions and sizes of conical frustums. As a result, mip-NeRF books a significant accuracy gain as well as an increase in speed compared to NeRF.
**Unprocessing**
The RawNeRF paper builds upon the work of mip-NeRF. In the RawNeRF paper, the denoising capabilities of the mip-NeRF model are improved by training the network with raw images, compared to post-processed images as used by mip-NeRF. When images are captured by a camera of for example a phone, various steps are applied to the image before we can view it on our screens. What steps are applied exactly, are explained in the paper ‘Unprocessing Images for Learned Raw Denoising’ [3]. Figure 2 provides a visualization of the steps.

_Figure 2: The process of unprocessing PNG images to raw images_
The part where RawNeRF outperforms the original mip-NeRF paper is with denoising images. The paper shows that when raw images are used to train the network, and we do post-processing on the
output image of the network, the PSNR values are higher for RawNeRF compared to mip-NeRF when noise is present in the raw images.
Raw images can be obtained in multiple ways. The authors of the RawNeRF paper capture images with their iPhone X and save the images in DNG format, which is a special format for storing raw images. They also render raw images from a blender scene of a lego object.
## Method
**Obtaining raw images**
Originally we wanted to reproduce Table 1 of RawNeRF [1], but the data that was used to generate these results was not publicly available. Therefore, we decided to do a reproducibility study on the synthetic lego dataset. There were PNG renders of this blender scene available online[^1]. These images have to be converted to raw images using the code from the paper by Tim Brooks et al [3]. The RawNeRF paper mentions that they unprocess the synthetic images by using various parameters from their iPhone captures’ metadata. This data was however not publicly available and we therefore used the default values as found in the code of the unprocessing paper [3]. Not all unprocessing steps were applied, demosaicing was not applied here as the RawNeRF paper mentions that this is not done for synthetic renders. Figure 3 shows an example of an unprocessed image.

_Figure 3: Converting PNG images (left) to raw images (right)_
**Noisy data**
Deep learning methods have increasingly been applied as a method of denoising raw images. Under the right circumstances, single image denoisers are outperformed by their multi-image counterparts. In the paper [1] RawNeRF is described to be competitive with compared deep denoisers. From paper [3] on unprocessing images for learned raw denoising, the codebase was used and tweaked to synthetically add noise to the data set. Once this noise was good qualitatively speaking, this was applied to create a noisy data set.

_Figure 4: Noisy input image_
**Postprocessing**
To calculate the PSNR values of the test images, we have to post-process the raw output images of the model back to LDR images. To do this, the code in the unprocessing [3] paper was used. For each of the training images, we saved the unprocessing parameters that were used to convert the images to raw images, and we used the same parameters to revert these steps. After we have the normal image back again, we can calculate the PSNR values of the normal image and the test image.

_Figure 5: Process of going from raw images to normal images_
## Experiments and results
Now we present the results of three different experiments. First we show there is little impact of training on a noisy data set compared to one without noise. Second, we show the positive impact of using a larger batch size in training on the results of RawNeRF. Third, we show the significant impact of the size of the training set on the results.
**Experiment 1: Noise**
The same training set without noise and with synthetically added noise is used to compare RawNeRF’s performance. The training set contains 120 images and the test set 80. The training set always consists of images captured from sufficiently different angles. In the original paper on RawNeRF, noise was added by simulating different shutter speeds resulting in different levels of noise. An infinite shutter time results in a “clean” image. In our experiment noise was added by altering the code from [3] to synthetically add noise to the dataset.
Overall, the resulting evaluation images from the model trained on the noiseless dataset look slightly better, but not by a significant margin. Sometimes areas of the resulting images the model trained on the noisy data set images look sharper or contain more contrast, which results in the fact that it is often hard to distinguish the two results. This small difference might imply that RawNeRF is performing well as a denoiser.
**Experiment 2: Batch size**
For this experiment, two different models are trained with the noiseless dataset, once with a batch size of 1024 and once with a batch size of 16384. Again, the training set contains 120 images and the test set 80. An example comparison of the resulting renders is visible in Figure 6.

_Figure 6. Training with 16k batch size (on the left) vs. 1k batch size (on the right). More detail is visible on the left image (i.e. the caterpillar in the left image is much sharper)._
Qualitatively, the larger batch size is shown to increase the quality of the results by a significant amount. On the left the result from the model trained with a batch size of 16384, has sharper edges and more detail.
**Experiment 3: Training set size**
In the final experiment we aim to find the impact of the size of the training set on the results. To do this the noisy data set is used with a training set of 60 images and test set of 80. The training images are sampled such that they still cover as many varying angles as possible.

_Figure 7. Training set of size 60 (left) vs. training set of size 120 (right)._
From Figure 7 above it can be seen that the larger training set slightly improves the results, but not a lot. The slight improvement can be seen in the red colors with yellow background and in the slightly sharper edges and more detail.
The PSNR values obtained by evaluating each of the model versions can be seen in Table 2. Both the mean and the standard deviation of the PSNR values are very consistent across the different models in our results.
<table>
<tr>
<td><strong>In experiment</strong>
</td>
<td><strong>Noise</strong>
</td>
<td><strong>Batch size</strong>
</td>
<td><strong>Train set size</strong>
</td>
<td><strong>Mean PSNR</strong>
</td>
<td><strong>STD PSNR</strong>
</td>
</tr>
<tr>
<td>2
</td>
<td>No
</td>
<td>16384
</td>
<td>120
</td>
<td>23.70
</td>
<td>1.10
</td>
</tr>
<tr>
<td>1, 2
</td>
<td>No
</td>
<td>1024
</td>
<td>120
</td>
<td>22.85
</td>
<td>1.08
</td>
</tr>
<tr>
<td>1, 3
</td>
<td>Yes
</td>
<td>1024
</td>
<td>120
</td>
<td>22.98
</td>
<td>1.12
</td>
</tr>
<tr>
<td>3
</td>
<td>Yes
</td>
<td>1024
</td>
<td>60
</td>
<td>22.87
</td>
<td>1.09
</td>
</tr>
</table>
_Table 2, the resulting PSNR values from testing the models trained with different data sets._
**Discussion and conclusion**
The experiments show that RawNeRF indeed works well as a denoiser, since there is only a small difference in the results of the experiments with the noisy and noiseless datasets. Furthermore, increasing the batch size from 1024 to 16384 in training has a significant impact on the quality of the output images. Finally there is a slight increase in output quality of training on a set size of 120 compared to a set of 60, although the difference is small.
If we compare our results to the results from RawNeRF [1] there is a difference in general quality. There are a few possible causes for this difference.
* Because not all parameters used during training and unprocessing are specified. We therefore might use different unprocessing parameters which might cause the anomalies. An example of such a parameter we managed to find was setting _white_bkgd = False_ which significantly increased performance.
* Although the changes to the network architecture were implemented exactly as described in the paper, it might be the case that a mistake was made.
* Because the camera settings for the blender renders were not provided, tuning these settings was not possible. Therefore the images were not correctly undistorted, which is essential for correct 3D reconstruction. We regard this as the most likely cause of the anomalies.
To conclude, the PSNR values do not say much for our results. In the original RawNeRF paper these values were useful for qualitatively evaluating the results, but for our results this is not possible. The noise has the least impact on the results, size of training set a little and batch size the most impact.
It would be interesting for future work to further investigate why we could not fully reproduce the results, as well as why the PSNR values varied so little.
**Contributions**
Rick Huizer: Mainly worked on converting the codebase of the mip-Nerf paper to the RawNeRF paper by changing the activation and loss function. Also worked on setting up the entire project so that it can be trained locally on my PC, using CUDA and CuDNN. Helped with evaluating the trained models using the provided evaluation script, and fixed various bugs that caused the training to not work properly.
Eduard Klein Onstenk: Mainly worked on the validation of the results, writing the experiments and results section as well as discussion and conclusion. The validation was done on the TPU’s from Google Cloud Platform.
Devin Lieuw A Soe: Updating the loss function and producing the raw clean and noisy input images in the first stages of the project. Helped with cloud setup and performing the experiments in the later stages. Creating the poster that was used for the presentation, as well as contributing in writing this blog post.
Thomas Markhorst: Worked on converting the codebase of the mip-Nerf paper to the RawNeRF paper by changing the loss function and processing the output raw images back to PNG images. Divided the test and training set and was responsible for training and evaluating the models. For which I deployed the project on Google Cloud Platform GPU and afterwards on Google Cloud Platform TPU using the free TPU’s for educational projects.
**References**
[1] Ben Mildenhall Peter Hedman Ricardo Martin-Brualla Pratul P. Srinivasan Jonathan T. Barron. NeRF in the Dark: High Dynamic Range View Synthesis from Noisy Raw Images. Google Research, 2021.
[2] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-NeRF: A multiscale representation for antialiasing neural radiance fields. ICCV, 2021.
[3] Tim Brooks, Ben Mildenhall, Tianfan Xue, Jiawen Chen, Dillon Sharlet, and Jonathan T. Barron. Unprocessing images for learned raw denoising. CVPR, 2019.
**Supplementary material**
<table>
<tr>
<td colspan="2" ><strong>v2_nerf_raw_120_16384</strong>
</td>
<td rowspan="7" >

</td>
</tr>
<tr>
<td>Test index
</td>
<td>64
</td>
</tr>
<tr>
<td>Batch size
</td>
<td>16k
</td>
</tr>
<tr>
<td>Training size
</td>
<td>120
</td>
</tr>
<tr>
<td>Noise
</td>
<td>No
</td>
</tr>
<tr>
<td>PSNR
</td>
<td>23.70 ± 1.10
</td>
</tr>
<tr>
<td>SSIMS
</td>
<td>0.84 ± 2.04E-02
</td>
</tr>
<tr>
<td colspan="2" ><strong>v2_1_nerf_raw_120_1024</strong>
</td>
<td rowspan="7" >

</td>
</tr>
<tr>
<td>Test index
</td>
<td>64
</td>
</tr>
<tr>
<td>Batch size
</td>
<td>1k
</td>
</tr>
<tr>
<td>Training size
</td>
<td>120
</td>
</tr>
<tr>
<td>Noise
</td>
<td>No
</td>
</tr>
<tr>
<td>PSNR
</td>
<td>22.85 ± 1.08
</td>
</tr>
<tr>
<td>SSIMS
</td>
<td>0.82 ± 2.58E-02
</td>
</tr>
<tr>
<td colspan="2" ><strong>v2_nerf_raw_noise_120_1024</strong>
</td>
<td rowspan="7" >

</td>
</tr>
<tr>
<td>Test index
</td>
<td>64
</td>
</tr>
<tr>
<td>Batch size
</td>
<td>1k
</td>
</tr>
<tr>
<td>Training size
</td>
<td>120
</td>
</tr>
<tr>
<td>Noise
</td>
<td>Yes
</td>
</tr>
<tr>
<td>PSNR
</td>
<td>22.98 ± 1.12
</td>
</tr>
<tr>
<td>SSIMS
</td>
<td>0.82 ± 2.44E-02
</td>
</tr>
<tr>
<td colspan="2" ><strong>v2_1_nerf_raw_noise_60_1024</strong>
</td>
<td rowspan="7" >

</td>
</tr>
<tr>
<td>Test index
</td>
<td>64
</td>
</tr>
<tr>
<td>Batch size
</td>
<td>1k
</td>
</tr>
<tr>
<td>Training size
</td>
<td>60
</td>
</tr>
<tr>
<td>Noise
</td>
<td>Yes
</td>
</tr>
<tr>
<td>PSNR
</td>
<td>22.87 ± 1.09
</td>
</tr>
<tr>
<td>SSIMS
</td>
<td>0.82 ± 2.52E-02
</td>
</tr>
</table>
<!-- Footnotes themselves at the bottom. -->
## Notes
[^1]:
[https://drive.google.com/file/d/1RjwxZCUoPlUgEWIUiuCmMmG0AhuV8A2Q/view](https://drive.google.com/file/d/1RjwxZCUoPlUgEWIUiuCmMmG0AhuV8A2Q/view)