owned this note
owned this note
Published
Linked with GitHub
# Kernel-Predicting Convolutional Networks for Denoising Monte Carlo Renderings
(Bako S, Vogels T et al. 2017)
## Introduction
- Photorealistic imaginery with physically-based Monte Carlo (MC) path tracing
- MC rendering
- Immense computational cost
- Long rendering time for noise-free images
- Many of the renderers ship with integrated denoisers (Pixar's RenderMan, Corona renderer, Chaos Group's VRay)
Contribution
- First deep learning solution for denoising MC renderings which was trained and evaluated on actual production data
- A novel kernel-prediction CNN architecture that computes the locally optimal neighborhood weights
- Design:
- A two-network framework for denoising diffuse and specular components of the image separately
- A simple normalization procedure that significantly improves the approach (as well as previous methods) for images with high dynamic range
## Previous work
Image-space General Monte Carlo Denoising:
- Joint bilateral
- Gaussian blur, but only consider neighbors that have values similar enough
- Joint -> auxiliary buffers (two images, etc.)
- Joint non-local means
- Takes a mean of all pixels in the image, weighted by how similar these pixels are to the target pixel
- It was shown that joint filtering methods, such as those cited above, can be interpreted as linear regressions using a zero-order model
Neural Networks:
- Using neural networks for denoising (e.g. recurrent denoising autoencoder by Chaitanya et al., 2017)
Problems:
- training a network to compute a denoised color from only a raw, noisy color buffer causes overblurring since the network cannot distinguish between scene noise and scene detail
- since the rendered images have high dynamic range, direct training can cause unstable weights
## Theoretical background
Per-pixel data:
$$
\mathbf{x}_{p}=\{\mathbf{c}_{p}, \mathbf{f}_{p}\}
$$
where $\mathbf{c}_{p}$ represents the RGB color channels and $\mathbf{f}_{p}$ is a set of $D$ auxiliary features.
The ideal denoising parameters at every pixel can be written as:
$$
\widehat{\boldsymbol{\theta}}_{p}=\underset{\boldsymbol{\theta}}{\operatorname{argmin}} \ell\left(\overline{\mathbf{c}}_{p}, g\left(\mathbf{X}_{p} ; \boldsymbol{\theta}\right)\right)
$$
where $\overline{\mathbf{c}}_{p}$ is the ground truth result, $\mathbf{X}_{p}$ is a block of per-pixel vectors around the neighborhood of pixel $p$, and $\widehat{\mathbf{c}}_{p} = g\left(\mathbf{X}_{p} ; \boldsymbol{\theta}\right)$ is the denoised value.
Ground truth values are not available at run time, so a weighted least-squares regression on the color values aroung the pixel's neighborhood is applied:
$$
\widehat{\boldsymbol{\theta}}_{p}=\underset{\boldsymbol{\theta}}{\operatorname{argmin}} \sum_{q \in \mathcal{N}(p)}\left(\mathbf{c}_{q}-\boldsymbol{\theta}^{\top} \phi\left(\mathbf{x}_{q}\right)\right)^{2} \omega\left(\mathbf{x}_{p}, \mathbf{x}_{q}\right)
$$
Supervised learning approach:
$$
\widehat{\boldsymbol{\theta}}=\underset{\boldsymbol{\theta}}{\operatorname{argmin}} \frac{1}{N} \sum_{i=1}^{N} \ell\left(\overline{\mathbf{c}}_{i}, g\left(\mathbf{X}_{i} ; \boldsymbol{\theta}\right)\right)
$$
Three issues:
- $g$ must be flexible enough (choice: deep convolutional network)
- $l$ (choice: absolute value loss function)
- must capture perceptually important differences between the estimated and reference color
- must be easy to evaluate and optimize
- large training dataset $D$
## Deep Convolutional Denoising
Since each layer of a CNN applies multiple spatial kernels with learnable weights that are shared over the entire image space, they are naturally suited for the denoising task and have indeed been previously used for traditional image denoising. Furthermore, by joining many such layers together with activation functions, CNNs are able to learn highly nonlinear functions of the input features, which are important for obtaining high-quality outputs.
### Network Architecture

### Reconstruction Methods
| Direct-prediction conv network (DPCN) | Kernel-prediction conv network (KPCN) |
| ----------------- |:----------------------- |
| The CNN directly predicts the final denoised pixel value as a highly non-linear combination of the input features. | Instead of directly outputting a denoised pixel, the final layer of the network outputs a kernel of scalar weights that is applied to the noisy neighborhood of the pixel. |
|Slower convergence|Faster convergence (5-6x faster)|
#### DPCN
- The unconstrained nature and complexity of the problem makes optimization difficult. The magnitude and variance of the stochastic gradients computed during training can be large, which slows convergence.
#### KPCN
- The kernel size is specified before training along with the other network hyperparameters and the same weights are applied to each RGB color channel.
Normalized kernel weights:
$$
w_{p q}=\frac{\exp \left(\left[\mathbf{z}_{p}^{L}\right] q\right)}{\sum_{q^{\prime} \in \mathcal{N}(p)} \exp \left(\left[\mathbf{z}_{p}^{L}\right]_{q^{\prime}}\right)}
$$
The denoised pixel color:
$$
\widehat{\mathbf{c}}_{p}=g_{\text {weighted }}\left(\mathbf{X}_{p} ; \boldsymbol{\theta}\right)=\sum_{q \in \mathcal{N}(p)} \mathbf{c}_{q} w_{p q}
$$
- The kernel weights can be interpreted as including a softmax activation function on the network outputs in the final layer over the entire neighborhood.
- 3 benefits:
- It ensures that the final color estimate always lies within the convex hull of the respective neighborhood of the input image. This vastly reduces the search space of output values as compared to the direct-prediction method and avoids potential artifacts (e.g. color shifts).
- It ensures the gradients of the error with respect to the kernel weights are well behaved, which prevents large oscillatory changes to the network parameters caused by the high dynamic range of the input. Intuitively, the weights need to only encode the relative importance of the neighborhood; the network does not need to learn the absolute scale. In general, scale-reparameterization schemes have recently proven to be crucial for obtaining low-variance gradients and speeding up convergence.
- It could potentially be used for denoising across layers of a given frame, a common case in production, by applying the same reconstruction weights to each component.
### Diffuse/Specular Decomposition
- The various components of the image have different noise characteristics and spatial structure, which often leads to conflicting denoising constraints.
- Solution: decomposing the image intodiffuse and specular components.
#### Diffuse-component Preprocessing
- The diffuse color — the outgoing radiance due to diffuse reflection is well behaved and typically has small ranges. Thus, training the diffuse CNN is stable and the resulting network yields good performance without color preprocessing.
#### Specular-component Preprocessing
- Denoising the specular color is a challenging problem due to the high dynamic range of specularand glossy reflections.
- Solution: log transform
## Experimental Setup
### Data
Training set:
- 600 representative frames sampled from the entire movie Finding Dory generated using RenderMan’s path-tracer
- Reference images: 1024 spp (samples per pixel)
- Inputs: 32 spp / 128 spp
- 65 x 65 patches
Test set:
- 25 diverse frames from the films Cars 3 and Coco, and contain effects such as motion blur, depth of field, glossy reflections, and global illumination
## Results
- Overall, KPCN performs as well or better than state-of-the-art techniques both perceptually and quantitatively.
## Analysis
Design choices
- $l_{1}$ loss - during experiments, it gives the lowest error
- DPCN / KPCN convergence speed:
- 