NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

# NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis ###### papers: [link](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123460392.pdf) ###### video: [link](https://www.youtube.com/watch?v=JuH79E8rdKc&ab_channel=MatthewTancik) ###### slide: `none` ## Abstract 1. Present a method that achieves results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views 2. Using a fully-connected (nonconvolutional) deep network whose input is a 5D coordinate (spatial location (x, y, z) and viewing direction (θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location ## Introduction ![](https://i.imgur.com/VsgRCrt.png) 1. A static scene as a continuous 5D function that outputs the radiance emitted in each direction (θ, φ) at each point (x, y, z) in space 2. A density at each point which acts like a differential opacity controlling how much radiance is accumulated by a ray passing through (x, y, z) 3. Using deep fully-connected neural network without any convolutional layers 4. address these issues by transforming input 5D coordinates with a positional encoding that enables the MLP to represent higher frequency functions ## Neural Radiance Field Scene Representation ![](https://i.imgur.com/7rMHGvu.png) 1. they encourage the representation to be multiview consistent by restricting the network to predict the volume density σ as a function of only the location x, while allowing the RGB color c to be predicted as a function of both location and viewing direction 2. the MLP FΘ first processes the input 3D coordinate x with 8 fully-connected layers (using ReLU activations and 256 channels per layer) and outputs σ and a 256-dimensional feature vector 3. This feature vector is then concatenated with the camera ray’s viewing direction and passed to one additional fully-connected layer (using a ReLU activation and 128 channels) that output the view-dependent RGB color ## Volume Rendering with Radiance Fields 1. We can cast a ray (camera ray) r(t) from the camera center o, and consider the near and far bounds of rendering ![](https://i.imgur.com/g55PzYt.png) 2. The expected color C( r ) seen by this ray can be expressed as ![](https://i.imgur.com/Q0x3Ooq.png) 3. T(t) represents the transmittance (transmittance) of this ray accumulated from near bound all to t 4. T(t) is the probability that this ray will not be blocked by any object or particle from the near bound to t 5. numerically estimate this continuous integral using quadrature. ![](https://i.imgur.com/ASRbWhD.png) 6. Cut the range of ray from near bound to fat bound into N bins, then calculate the value of each bin and add them up ## Network Architecture ![](https://i.imgur.com/aludXEA.png) * The positional encoding of the input location (γ(x)) is passed through 8 fully-connected ReLU layers * This feature vector is concatenated with the positional encoding of the input viewing direction (γ(d)), and is processed by an additional fully-connected ReLU layer with 128 channels ## How NeRF works 1. NeRF's network encodes the information of objects and scenes. For each location X = (x, y, z) and view direction (θ, φ), the emitted color c=(r, g, b)) and volume density σ 2. Assuming that a picture is to be rendered from a certain position and angle, it is to cast a ray from the center of the camera to each pixel, and then 3. Each point in the ray is get into the NeRF network to get the emitted color and volume density, and finally the color of the pixel is obtained by integrating with Quadrature ## Optimizing a Neural Radiance Field 1. Positional Encoding * The ability of deep neural network to learn high frequency function is relatively poor * The input of the NeRF network additionally uses positional encoding. This positional encoding function is For each value p, use a combination of sine and cosine to project low-level information into a high-dimensional space ![](https://i.imgur.com/9UmNCfj.png) ![](https://i.imgur.com/Dn09eRi.png) 2. Hierarchical Volume Sampling * rendering strategy of densely evaluating the neural radiance field network at N query points along each camera ray is inefficient * free space and occluded regions that do not contribute to the rendered image are still sampled repeatedly * **Divide the NeRF network into two, representing course and fine respectively. First use the original sampling method, and then predict through the course network. Next, use the prediction of the course network to find out the position that needs to be actively sampled for the second sampling, and then throw all the points sampled twice to the fine network for processing as the final output** ## Results ![](https://i.imgur.com/O0qhlgf.png) ![](https://i.imgur.com/AIzcyGf.png) ## 補充資料 * 程式說明 * 射線生成 * ![](https://i.imgur.com/NeWzexP.png) * 積分採樣點 * ![](https://i.imgur.com/GB2ycvE.png) * 送進MLP網路後計算LOSS rate * ![](https://i.imgur.com/jGKNzQr.png) * 主要程式 * ![](https://i.imgur.com/bpswQPw.png)