NeRF Week 5 - HackMD

:::info # <center><i class="fa fa-edit"></i> Neural Radiance Fields </center> ::: ###### tags: `Neural` ` MLP` `Radiance` `Fields` # Introduction After a couple of busy days trying to wrap my head around the concepts of volumetric rendering, I’m happy to continue with my journal, detailing my progress and the challenges I’ve faced. Among the literature shared by my colleague, the paper titled 3D Gaussian Splatting for Real-Time Radiance Field Rendering caught my attention, and I decided to do a detailed review of it. While reading the abstract, I quickly realized that Gaussian splatting is built upon Neural Radiance Fields (NeRFs) and photogrammetry. This prompted me to revisit the basics and refresh my knowledge of both photogrammetry and NeRFs. In my next journal entry, I will dive deeper into photogrammetry—which will essentially be a review of material we've already covered. But for now, let’s take a look at what NeRFs are and why they are so important in 3D reconstruction. ## Getting Warm On November 7th, 2024, the famous Notre Dame Cathedral reopened its doors, five years after the devastating fire that caused significant damage to the structure. The subsequent years were dedicated to its reconstruction and restoration, at a cost of approximately €700 million. A crucial element in the restoration process was the point cloud data collected by Professor Tallon, an architectural historian who had gathered the data with the aim of understanding structural anomalies and the Gothic architecture of the cathedral. This dataset provided the only accurate as-built measurements of Notre Dame. In conclusioin, the importance of 3D reconstruction cannot be overstated. ## What are Nerfs? NeRFs, or Neural Radiance Fields, are neural networks used to represent and render 3D scenes from 2D images. In simplified terms, NeRFs learn the behavior of light as it traverses through various scenes and objects by training on a set of 2D images captured from different viewpoints. Once the neural network is fully trained, it can generate novel views even from angles that were not part of the original input. From a technical perspective, a NeRF can be understood as a function that takes in a 5D input vector: {X, Y, Z} for spatial coordinates and {θ, φ} for viewing direction (pitch and yaw). A multilayer perceptron (MLP) processes this input and outputs RGBA values where RGB represents color and A (alpha) represents opacity. These are the essential ingredients for volumetric rendering. What's particularly interesting is that NeRFs achieve photorealistic results in the rendering process. For example, think of how a silvery metallic object looks from a distance—it may appear uniformly grey. But when sunlight hits its surface directly, it can appear bright white in places. This variation in appearance based on lighting and viewpoint is exactly what NeRFs attempt to replicate, offering a more realistic and immersive visual experience. The figure below gives a summary of the Nerfs pipeline ![6fnx0u56](https://hackmd.io/_uploads/SJpr__l-ex.bmp) Figure 1: [Source](https://www.it-jim.com/blog/nerf-in-2023-theory-and-practice/) ## NeRF Pipeline As I was learning about NeRFs and reading tons of tutorials to understand the concept, I quickly observed that the NeRF pipeline can be broken down into the following steps 1. Generation of Rays: As mentioned earlier, it is important to understand the behavior of light as it traverses through various scenes and mediums in order to accurately model and render photorealistic scenes with a rich, immersive visual experience. 2. Sampling Rays at Specific Intervals: Recall that light rays are continuous in nature. In real life, it is not possible to analyze a continuous function directly, so we need to sample them at discrete intervals. 3. Positional Encoding: The sampled points are passed through a positional encoding function, which returns encoded values. This step is important because neural networks suffer from spectral bias, they tend to learn low-frequency components of a function more easily than high-frequency ones. To address this, the input data is transformed using Fourier feature mapping, allowing the network to better learn high-frequency details, making positional encoding essential. 4. Feedforward Neural Network (Multi-Layer Perceptron): The encoded values are then passed to a multi-layer perceptron (MLP), which consists of fully connected layers and ReLU (Rectified Linear Unit) activation functions that introduce non-linearity. 5. Ray Rendering: The MLP outputs RGBA values, which are then used in the final step: volumetric rendering. :-1: I did a simple implementation without training the neural network and as you can expect the results were not any promising. 1) March camera rays through the scene to generate a sampled set of 3D points 2) Use those points and their corresponding 2D viewing directions as input to the neural network to produce an output set of colors and densities 3) Use classical volume rendering techniques to accumulate those colors and densities into a 2D image 4) Minimize error between rendered color and GT color ## Step Wise Implementation: ### Step I : Ray Generation The code attached below gives a sample implementation of light ray generation ```python= import torch import torch.nn as nn import numpy as np import matplotlib.pyplot as plt # Sampling of ray # volume rendering we need to obtain ray direction in real world generate ray def generate_rays(camera_matrix, img_height, img_width): #convert matrix to a tensor camera_matrix = torch.tensor(camera_matrix, dtype=torch.float32) i,j = torch.meshgrid(torch.arange(img_height, dtype=torch.float32), torch.arange(img_width, dtype=torch.float32), indexing='ij') #genrate in camera space directioon map to world coordinates directions = torch.stack([(j - camera_matrix[0,2])/camera_matrix[0,0], (i - camera_matrix[1,2])/camera_matrix[1,1], torch.ones_like(i)], dim=-1) #rotate camera directions to world directions = (directions @ camera_matrix[:3, :3].T) #all rays originate from same camera origin translation matrix origins = camera_matrix[:3, 3].expand(directions.shape) return origins, directions ``` This function generates ray origins and directions for each pixel in an image. It starts by creating a meshgrid that represents pixel coordinates. These coordinates are converted to 3D ray directions in camera space using the intrinsic parameters from the camera matrix. The directions are then transformed to world space using the camera’s rotation matrix. All rays originate from the camera’s position in world coordinates, allowing us to simulate how light travels through the scene for rendering or 3D reconstruction. As a good practise, it is always good to test and see if the function is working as expected. The following output was obtained ```python= camera_matrix = poses[0] a, b = images[0].shape[0:2] a, b origins, directions = generate_rays(camera_matrix, a, b) ``` The function returned the two values, I was not sure if it was correct but sequentially we will tell if the results are of any good. ### Step II: Sample Points In this section the generated rays will be sampled and as you can see below, the sampling function is straight forward. We will take the 'origins' which is the translation vector, the camera directions matrix, the number of times the ray will be sampled and define some parameters which are the proximity. ```python= #sample rays def sample_points(origins, directions, num_samples=64, near=2.0, far=6.0): #number of rays correspond to number of pixels N_rays = origins.shape[0] #generate sample distances along the ray t_vals = torch.linspace(near, far, steps=num_samples) #broadcast sample for each ray t_vals = t_vals.expand(N_rays, num_samples) #compute points along each ray / ray marching points = origins[:, None, : ] + t_vals[..., None] * directions[:, None, :] return points, t_vals ``` The number of rays are 100 which correspond to number of pixels, the values 2.0 and 6.0 chosen are dependent on the scene and if the plane is near that you wish to compute there is a risk of wasting computation on an empty space similarly if the plane is too far there is a risk of sampling irrelevant space. T The function returned successfully. ### Step III: Positional Encoding We know have the positions as well as the directionbs that we need to be encoded. The encoding function is shown below ```python= #positional encoding def positional_encoding(x, num_frequencies=10): #map low dimensional data to high dimension frequency_power = [ 2 ** i for i in range(num_frequencies)] encoding = [] for freq in frequency_power: encoding.append(torch.sin(x * freq)) encoding.append(torch.cos(x * freq)) return torch.cat(encoding, dim=-1) flattened_points = origins.view(-1,3) flattened_directions = directions.view(-1,3) #apply positional encoding encoded_points = positional_encoding(flattened_points) encoded_directions = positional_encoding(flattened_directions) #concatenate encoding concatenated_encoding = torch.cat([encoded_points, encoded_directions ]) ``` Fourier feature mapping is performed to minimize spectral bias. We now have the encoded points, which we can pass to our feed forward neural network ### Step IV: Feed Forward Neural Network ```python! # MLP set up class NeRF(nn.Module): def __init__(self, input_dim, hidden_dim=256, output_dim=4): #class constructor super(NeRF, self).__init__() #network self.network = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, output_dim), ) #propagation of the input thru network def forward(self, x): return self.network(x) #created class instance nerf_model = NeRF(input_dim=concatenated_encoding.shape[-1]) output = nerf_model(concatenated_encoding) ``` Simple feed forward neural network is employed, there is no convolution but rather simple fully connected neural network. To introduce non linearity rectified linear unit is used. The output shape returns a samples with 4 features as expected. ### Step V: Rendering ```python! def render_ray(ray_samples, densities, colors): # alpha alphas = 1.0 - torch.exp(-densities) # Ensure alphas is at least 1D if alphas.dim() == 0: alphas = alphas.unsqueeze(0) # Compute weights using alpha transmittance = torch.cumprod( torch.cat([torch.ones(1, device=alphas.device), 1.0 - alphas[:-1] + 1e-10]), dim=0 ) weights = alphas * transmittance # Final RGB color = sum of weights * color samples rendered_color = torch.sum(weights[:, None] * colors, dim=0) return rendered_color ``` ### Step VI: Training Loop ```python= # Initialize NeRF model and optimizer nerf_model = NeRF(input_dim=120) optimizer = torch.optim.Adam(nerf_model.parameters(), lr=5e-4) focal = data['focal'] images = data['images'] poses = data['poses'] # Training loop for img_idx in range(len(images)): image = images[img_idx] pose = poses[img_idx] H, W = image.shape[:2] # Generate rays rays_o, rays_d = generate_rays(pose, img_height=H, img_width=W) rays_o = rays_o.view(-1, 3) rays_d = rays_d.view(-1, 3) # Sample points along rays points, t_vals = sample_points(rays_o, rays_d, num_samples=100, near=1.0, far=7.0) # Positional encoding of spatial points flattened_points = points.view(-1, 3) encoded_points = positional_encoding(flattened_points) # Repeat ray directions to match points directions = rays_d[:, None, :].expand(-1, 100, -1) flattened_directions = directions.reshape(-1, 3) encoded_dirs = positional_encoding(flattened_directions) # Concatenate position and direction encodings inputs = torch.cat([encoded_points, encoded_dirs], dim=-1) # Predict with NeRF model outputs = nerf_model(inputs) outputs = outputs.view(rays_o.shape[0], 100, 4) colors = outputs[..., :3] density = outputs[..., 3] # Render rays rendered_colors = [] for i in range(rays_o.shape[0]): color = render_ray(t_vals[i], density[i], colors[i]) rendered_colors.append(color) # Assemble rendered image rendered_image = torch.stack(rendered_colors).view(H, W, 3) # Compute loss image_tensor = torch.tensor(image, dtype=torch.float32) loss = torch.nn.functional.mse_loss(rendered_image, image_tensor) # Backpropagation optimizer.zero_grad() loss.backward() optimizer.step() print(f"Image {img_idx}, Loss: {loss.item():.4f}") ``` ### Step VII: Visualization ## Dataset The dataset used in this experimentation was obtained from kaggle [Nerf Dataset](https://www.kaggle.com/datasets/huabrother/tiny-nerf-data) Reading the dataset ```python! #load the data data = np.load('tiny_nerf_data.npz') print(data.files) if data: print('data loaded successfully') else: raise ValueError(f'Invalid path ') #sort the data images = data['images'] poses = data['poses'] focal = data['focal'] print(f'image shape {images.shape} \t poses shape {poses.shape} \t focal shape {focal.shape}') plt.subplot(2,2,1) plt.axis('off') plt.title('Captured Scene') plt.imshow(images[2]) plt.subplot(2,2,2) plt.axis('off') plt.title('Captured Scene') plt.imshow(images[4]) plt.subplot(2,2,3) plt.axis('off') plt.title('Captured Scene') plt.imshow(images[52]) plt.subplot(2,2,4) plt.axis('off') plt.title('Captured Scene') plt.imshow(images[34]) ``` ![1320f530-aa02-4138-a8d6-c4f130e9856f](https://hackmd.io/_uploads/r1ysT6eWxe.png) ## Comments Training Nerf especially for a first time was quite engaging, it took two days just to have the model running. The training phase is computationally heavy and it requires optimized hardware for faster perfomance. I now have a starting point and this forms a stepping stone for me to experiment further. ## Results The training process took really long, somewhere slightly more than 3 hours, based on the results; i could not see any convergence, the error was fairly low but did not improve with the datasets, this is something, I have to look into. The figure below shows the actual training time on my cpu ![image](https://hackmd.io/_uploads/ryr346-Wxg.png) ## Testing I was curious to evaluate the quality of the NeRF implementation. To assess its performance, I randomly generated some camera poses, hoping to observe meaningful results. Ideally, the NeRF model should be capable of generating novel views. The testing phase code is shared below ```python= my_generic_pose = np.matrix('6.546e-01, 8.234e-01, -5.97e-01, -1.567e+00;2.546e-01, 7.134e-01, -9.97e-01, -1.457e+00; 1.546e-01, 6.134e-01, -5.27e-01, -4.567e+00; 0.00, 0.00, 0.00, 1.000 ') #generate the rays for the scene img_height = 100 img_width = 100 #generate rays rays_o, rays_d = generate_rays(my_generic_pose, img_height=100, img_width=100) #reshape the values rays_d = rays_d.view(-1,3) rays_o = rays_o.view(-1,3) #sample points points, t_vals = sample_points(rays_o, rays_d, num_samples=100, near=2.0, far=6.0) #perform encoding flattened_points = points.view(-1, 3) encoded_points = positional_encoding(flattened_points) #reshape the rays directions = rays_d[:, None, :].expand(-1,100,-1) flattened_directions = directions.reshape(-1,3) encoded_directions = positional_encoding(flattened_directions) # Repeat ray directions to match points #print(f"shape of encoded points is {encoded_points.shape} and rays is {encoded_directions.shape}") #concatenate them nerf_input = torch.cat([encoded_directions, encoded_points], dim=-1) #run the model with torch.no_grad(): outputs = nerf_model(nerf_input) outputs = outputs.view(rays_o.shape[0], 100, 4) colors = outputs[..., :3] density = outputs[..., 3] rendered_colors = [] for i in range(rays_o.shape[0]): color = render_ray(t_vals[i], density[i], colors[i]) rendered_colors.append(color) rendered_image = torch.stack(rendered_colors).view(H,W,3) #visualize img = rendered_image.clamp(0,1).cpu().numpy() plt.imshow(img) plt.axis('off') plt.show() ``` I was dissappointed not to get any meaningful results as seen below. In the coming days, I will review my implementation and hope to improve the results. #### RENDERED scene ![1ced38d9-c62d-406c-a673-e49aeb1a2ea4](https://hackmd.io/_uploads/ryvgLTZZxg.png) ## Challenges * Shape mismatch when working with tensors * Training is computationaly expensive for the PC *