# TMLR response
## Response to R1
Thank you for your constructive comments! We have revised our manuscript according to the requested changes. Major changes are highlighted in blue.
**Improving clarity of formulas**.
<!-- 1. In Eq. (2), i.e., $\{{\rho}(\mathbf{x}, \boldsymbol{\omega}_\text{light}, \boldsymbol{\omega}_\text{out}), \sigma(\mathbf{x})\} = \text{NN}(\textbf{x}, \boldsymbol{\omega}_\text{light}, \boldsymbol{\omega}_\text{out})$, we used {} to indicate that both our transfer function values $\boldsymbol\rho$ and volumetric density $\sigma$ are the outputs of a deep network $\text{NN}$. We replace all such notations (in Eq. (2), Eq. (11), and inline text) with $\text{NN}_\Theta: (\mathbf{x}, \boldsymbol{\omega}_\text{light}, \boldsymbol{\omega}_\text{out})\rightarrow({\rho}, \sigma)$, where $\Theta$ denotes the network parameters. -->
1. We have replaced all such notations (in Eq. (2), Eq. (11), and inline text) with $\text{NN}_\Theta: (\mathbf{x}, \boldsymbol{\omega}_\text{light}, \boldsymbol{\omega}_\text{out})\rightarrow({\rho}, \sigma)$, where $\Theta$ denotes the network parameters.
2. In Eq. (1), we have replaced $L(\boldsymbol r)$ with $L(\mathbf o, \boldsymbol\omega_\text{out})$, which denotes the radiance arriving at the camera location $\mathbf o$ from direction $\boldsymbol\omega_\text{out}$. We have made such notations consistent across our text.
3. In Eq. (8), we have replaced $\mathcal{L}$ with $\mathcal{L}(\Theta)$ to indicate the parameters $\Theta$ to optimize.
**More evaluation on a benchmark**.
We have added experiments on $5$ real opaque objects from the Diligent-MV dataset [Li et.al., 2020]. This dataset provides multi-view photometric images of $5$ objects that feature different levels of shininess. For each object, there are $20$ views forming a circle around the object. For each view, there are $96$ calibrated light sources spatially fixed relative to the camera. We use a front view and a back view (view $10$ and view $20$) as the testset, and the other $8$ views as the training set. Note that the light directions in the training set are completely disjoint from the testset since the light sources move with the camera. We show both qualitative results (Figure 5) and quantitative metrics in Section 4.2 and attach the numbers below (we additionally compare to a recent method IRON [Zhang et.al., 2022], while we drop NeRD as it has convergence issues on this dataset):
| Diligent-MV dataset | PSNR $\uparrow$ | SSIM $\uparrow$ | LPIPS $\downarrow$
|:----------|:--------:|:--------:|:--------:|
| o-NeRF | 31.36 | 0.944 | 0.038
| PhySG | 21.59 | 0.942 | 0.061
| IRON | 14.70 | 0.895| 0.082
| OSF (ours) | **39.07** | **0.970** | **0.020**
Our models synthesize reasonably faithful images on these real opaque objects, which feature complex effects including self-shadows (e.g., the bear and the buddha) and shiny reflections (e.g., the buddha's belly and the cow). Quantitatively, our models outperform the compared methods significantly. From Figure 5 we observe that IRON produces overly bright synthesis, possibly because it explains bright regions (e.g., highlights caused by non-collocated lighting) by a very high albedo. Therefore, its PSNR is very low.
**Why NeRF cannot fit details?**
NeRF assumes a static scene without any lighting condition changes. In the datasets we use, the images of each object can have different lighting conditions, and thus NeRF failed to fit them well. We add this clarification in Section 4.2.
**Discussion on limitations**.
The main technical idea is to approximate complex light transport by learning a radiance transfer function which relies on the assumption of unobstructed distant lights (Eq. (6)). A major limitation of this approximation is that when the object is partially occluded from the light source, the outgoing radiance estimation is biased. Further discussion can be found in Section 3.5. In addition, while the distant light assumption can be reasonably approximated for real small objects by holding the light source far away from the object (as we show in experiments), it is difficult to do this for big objects. We have added this discussion also to Section 5.
***Reference***
[Li et.al., 2020] Min Li, et al., "Multi-view photometric stereo: A robust solution and benchmark dataset for spatially varying isotropic materials." IEEE Transactions on Image Processing, 2020
[Zhang et.al., 2022] Kai Zhang, et.al., "IRON: Inverse rendering by optimizing neural sdfs and materials from photometric images", In CVPR, 2022
## Response to R2
Thank you for your constructive comments! We have revised our manuscript according to the requested changes. Major changes are highlighted in blue.
**Evaluation on additional real objects**.
We have added experiments on $5$ real opaque objects from the Diligent-MV dataset [Li et.al., 2020]. This dataset provides multi-view photometric images of $5$ objects that feature different levels of shininess. For each object, there are $20$ views forming a circle around the object. For each view, there are $96$ calibrated light sources spatially fixed relative to the camera. We use a front view and a back view (view $10$ and view $20$) as the testset, and the other $8$ views as the training set. Note that the light directions in the training set are completely disjoint from the testset since the light sources move with the camera. We show both qualitative results (Figure 5) and quantitative metrics in Section 4.2 and attach the numbers below (we additionally compare to a recent method IRON [Zhang et.al., 2022], while we drop NeRD as it has convergence issues on this dataset):
| Diligent-MV dataset | PSNR $\uparrow$ | SSIM $\uparrow$ | LPIPS $\downarrow$
|:----------|:--------:|:--------:|:--------:|
| o-NeRF | 31.36 | 0.944 | 0.038
| PhySG | 21.59 | 0.942 | 0.061
| IRON | 14.70 | 0.895| 0.082
| OSF (ours) | **39.07** | **0.970** | **0.020**
Our models synthesize reasonably faithful images on these real opaque objects, which feature complex effects including self-shadows (e.g., the bear and the buddha) and shiny reflections (e.g., the buddha's belly and the cow). Quantitatively, our models outperform the compared methods significantly. From Figure 5 we observe that IRON produces overly bright synthesis, possibly because it explains bright regions (e.g., highlights caused by non-collocated lighting) by a very high albedo. Therefore, its PSNR is very low.
**Explanation of synthetic data generation**.
1. *Choosing objects*. We consider objects that have interesting shapes (e.g., all objects selected from the ObjectFolder dataset feature highly non-convex shapes such as the airplane and the cup) or standard evaluation shape (e.g., the Stanford Bunny), and different textures (e.g., the bowl and kart have sharp texture edges, while the bunny and cup have smooth textures).
2. *Setting up cameras and lightings*. For comprehensive evaluation of different viewing angles and lighting angles (which is difficult for real objects due to physical limitations during capture), we sample cameras uniformly on an upper hemisphere, and light directions uniformly on solid angles, for both training set and test set. We add this clarification to Section 4.1.
3. *Translucent materials*. We generate synthetic images in Blender 3.0, using the Cycles path tracer. We use the Principled BSDF [Burley, 2012] with Christensen-Burley approximation to the physically-based subsurface volume scattering [Christensen, 2015]. We add this clarification to Section 4.1. Please see below for additional experiments on different levels of subsurface scattering using this model.
**How well can we model different levels of translucency?**
We have added experiments on a Stanford Bunny with different levels of translucency. We set the subsurface scattering intensity to {0, 0.1, 0.3, 0.6, 1} of five otherwise identical objects (we show comparison in Figure S1 in supplement). We generate 500 images for training and 500 images for testing for each object. The cameras and light directions are the same across each object. The results are shown below.
|Subsurface scattering intensity| PSNR $\uparrow$ | SSIM $\uparrow$ | LPIPS $\downarrow$ |
|:-----------|:---------------:|:---------------:|:------------------:|
| 0 | 30.01 | 0.89 | 0.067 |
| 0.1 | 34.55 | 0.94 | 0.030 |
| 0.3 | 37.46 | 0.97 | 0.011 |
| 0.6 | 39.63 | 0.98 | 0.006 |
| 1 | 40.91 | 0.99 | 0.004 |
As shown in the table (please also refer to Figure S1), our OSFs can model translucent objects very well. Quantitative results are getting better with higher degrees of translucency. A major reason is that the appearances of translucent objects vary smoothly w.r.t. changing view angles and lighting directions, and thus a learned neural implicit model is suitable to represent them. Opaque concave objects can have harsh self-shadows which are very high-frequency signals that are difficult for neural models to represent and interpolate [Rahaman, 2019].
***References***
[Li et.al., 2020] Min Li, et al., "Multi-view photometric stereo: A robust solution and benchmark dataset for spatially varying isotropic materials." IEEE Transactions on Image Processing, 2020
[Zhang et.al., 2022] Kai Zhang, et.al., "IRON: Inverse rendering by optimizing neural sdfs and materials from photometric images", In CVPR, 2022
[Burley, 2012] Brent Burley. "Physically-based shading at disney." In ACM SIGGRAPH, 2012.
[Christensen, 2015] PH Christensen. "An approximate reflectance profile for efficient subsurface scattering." In ACM SIGGRAPH 2015 Talks, 2015.
[Rahaman, 2019] Nasim Rahaman, et.al. "On the spectral bias of neural networks." ICML, 2019.
## Response to R3
Thank you for your constructive comments! We have revised our manuscript according to the requested changes. Major changes are highlighted in blue.
**Using non-compressed renderings in figures**.
Thanks. We have updated all our figures with non-compressed versions.
**Discussion on IRON**.
We note that IRON is not suitable in the datasets we consider such as the Diligent-MV photometric dataset, because it assumes that the light and the camera are collocated for every image and that little self-shadow appears in training images. In addition, IRON does not model subsurface scattering for translucent objects. Nevertheless, we have added comparisons to IRON in our free viewpoint relighting experiments on the Diligent-MV dataset (Table 3 and Figure 5). From Figure 5 we observe that IRON produces overly bright synthesis, possibly because it explains bright regions (e.g., highlights caused by non-collocated lighting) by a very high albedo. Therefore, its PSNR is very low in Table 3.
**Discussion on Zheng et al. (2021)**.
While Zheng et al. (2021) demonstrates performing similar tasks, our KiloOSF is much faster in rendering. Zheng et al. (2021) uses specific designs such as spherical harmonics for scattering and visibility modeling. In contrast, our model is simple and allows adopting recent advances in accelerating neural rendering. For comparison, Zheng et al. (2021) reports $7.9$s for rendering a single $400\times400$ image ($0.1$ FPS) with $64$ samples per primary ray (this paper does not provide source code so we try to follow their reported setting). KiloOSF needs only $97$ms ($10.3$ FPS) for the same resolution and sample count, providing an interactive frame rate. We add this discussion to the related work.
**Evaluation on Diligent-MV**.
We have added experiments on $5$ real opaque objects from the Diligent-MV dataset [Li et.al., 2020]. This dataset provides multi-view photometric images of $5$ objects that feature different levels of shininess. For each object, there are $20$ views forming a circle around the object. For each view, there are $96$ calibrated light sources spatially fixed relative to the camera. We use a front view and a back view (view $10$ and view $20$) as the testset, and the other $8$ views as the training set. Note that the light directions in the training set are completely disjoint from the testset since the light sources move with the camera. We show both qualitative results (Figure 5) and quantitative metrics in Section 4.2 and attach the numbers below (we additionally compare to a recent method IRON [Zhang et.al., 2022], while we drop NeRD as it has convergence issues on this dataset):
| Diligent-MV dataset | PSNR $\uparrow$ | SSIM $\uparrow$ | LPIPS $\downarrow$
|:----------|:--------:|:--------:|:--------:|
| o-NeRF | 31.36 | 0.944 | 0.038
| PhySG | 21.59 | 0.942 | 0.061
| IRON | 14.70 | 0.895| 0.082
| OSF (ours) | **39.07** | **0.970** | **0.020**
Our models synthesize reasonably faithful images on these real opaque objects, which feature complex effects including self-shadows (e.g., the bear and the buddha) and shiny reflections (e.g., the buddha's belly and the cow). Quantitatively, our models outperform the compared methods significantly.
**Compositions with indirect illumination**.
Following your suggestion, we have incorporated a simplified cornell box-like scene. To analyze the light transport effects including shadows and indirect lighting, we showcase a simple scene consisting of a grey cube and a red floor in Figure 9, where we show our scene composition and variants that remove shadow and indirect lighting. From Figure 9, we observe increased realism when we add shadow and indirect lighting. Specifically, we see a color bleeding effect from the red floor onto the cube by the indirect lighting. With these light transport effects, it correctly synthesizes the scene up to two light bounces (i.e., direct lighting and one-bounce indirect lighting).
While here we only showcase a simplistic scene and compute only one bounce for indirect lighting due to computational constraints, we argue that this is a meaningful step towards more realistic compositional neural rendering, and our method may benefit from future advances in efficient neural rendering to scale to complex scenes and more light bounces.
**Table 5**.
We have fixed Table 5.
**How synthetic objects are chosen**.
We consider objects that have interesting shapes (e.g., all objects selected from the ObjectFolder dataset feature highly non-convex shapes such as the airplane and the cup) or standard evaluation shape (e.g., the Stanford Bunny), and different textures (e.g., the bowl and kart have sharp texture edges, while the bunny and cup have smooth textures).
**Evaluate on more objects**.
Thanks for the point. As mentioned above, we have included 5 additional real objects from the Diligent-MV dataset. In addition, we have also included an evaluation of 5 synthetic Stanford Bunny models, focusing on how well OSF can model different levels of translucency in Section C.1 in supplement.
**Clarify image segmentation**.
The clarification was included in Section 4.1 of the original submission (Real image capture setup).
**Merging section 3.5 and 3.6**.
We clarify that Section 3.6 ("Accelarating OSF Rendering") included two parts: KiloOSF extension which accelerates inference for any OSF model, and excluding non-intersecting rays in scene composition. The former is applicable to both single object and scene composition. Thus, following your suggestion, we merge the latter with the scene composition subsection, and put the KiloOSF part as a subsection entitled as "KiloOSF for accelerating rendering".
**Spatial arrangment of Table 3-5**.
Thanks. We have followed your suggestion to rearrange them.
**Viewpoints in real data split**.
For each light direction, we hold our camera and randomly walk around the scene to take photos, so that every photo has a different camera pose. The average of distances between every testing camera center and its nearest neighbor training camera center is $1.3$cm. Consider NeRF's bulldozer scene by matching its diameter with our soaps ($10$cm). the average distance is $1.4$cm which is very closed to ours.
***References***
[Li et.al., 2020] Min Li, et al., "Multi-view photometric stereo: A robust solution and benchmark dataset for spatially varying isotropic materials." IEEE Transactions on Image Processing, 2020
[Zhang et.al., 2022] Kai Zhang, et.al., "IRON: Inverse rendering by optimizing neural sdfs and materials from photometric images", In CVPR, 2022
## Response to R4
Thank you for your constructive comments! We have revised our manuscript according to the requested changes. Major changes are highlighted in blue.
**Explain how to use the baseline o-NeRF in relighting**.
o-NeRF can do view synthesis but not relighting as it is agnostic to light source. Thus, in our experiments we discard light information for o-NeRF and only show view synthesis results. We add this clarification in Section 4.1.
**Evaluating on datasets with more challenging visual effects**.
As suggested, we have added experiments on $5$ real opaque objects from the Diligent-MV dataset [Li et.al., 2020]. This dataset provides multi-view photometric images of $5$ objects that feature different levels of shininess. For each object, there are $20$ views forming a circle around the object. For each view, there are $96$ calibrated light sources spatially fixed relative to the camera. We use a front view and a back view (view $10$ and view $20$) as the testset, and the other $8$ views as the training set. Note that the light directions in the training set are completely disjoint from the testset since the light sources move with the camera. We show both qualitative results (Figure 5) and quantitative metrics in Section 4.2 and attach the numbers below (we additionally compare to a recent method IRON [Zhang et.al., 2022], while we drop NeRD as it has convergence issues on this dataset):
| Diligent-MV dataset | PSNR $\uparrow$ | SSIM $\uparrow$ | LPIPS $\downarrow$
|:----------|:--------:|:--------:|:--------:|
| o-NeRF | 31.36 | 0.944 | 0.038
| PhySG | 21.59 | 0.942 | 0.061
| IRON | 14.70 | 0.895| 0.082
| OSF (ours) | **39.07** | **0.970** | **0.020**
Our models synthesize reasonably faithful images on these real opaque objects, which feature complex effects including self-shadows (e.g., the bear and the buddha) and shiny reflections (e.g., the buddha's belly and the cow). Quantitatively, our models outperform the compared methods significantly. From Figure 5 we observe that IRON produces overly bright synthesis, possibly because it explains bright regions (e.g., highlights caused by non-collocated lighting) by a very high albedo. Therefore, its PSNR is very low.
Please also note that in our experiments on real translucent soaps, there are also highlights on the surface of the soaps (Figure 4).
**PSNR in Table 5**.
We have fixed Table 5.
The following are our responses to other mentioned weaknesses that are not included in requested changes:
**OSF only allows single distant light source**.
OSF supports multiple light sources as it learns a radiance transfer function: the outcome radiance of an object being lit by multiple distant lights is equal to the sum of outcome radiance from the object being lit by each distant light. To validate this, we have included an ablation study. For each synthetic opaque object, we add a test set where the object is lit by two distant lights. All other configurations are the same as the original single-light test set, and we use the same OSF models trained with single light sources. We show a visual comparison in Figure S2 and numbers in Table S2 in our supplement, and we attach the numbers below:
| | PSNR $\uparrow$ | SSIM $\uparrow$ | LPIPS $\downarrow$ |
|:----------|:--------:|:--------:|:--------:|
| Lit by a single light source | **34.26** | 0.92 | 0.034 |
| Lit by two light sources | 32.40 | **0.93** | **0.027** |
As can be observed from the table, the free viewpoint relightings using two distant lights are comparable to those using one distant light. We note that the widely-used environment maps are also simply a collection of distant lights.
**SSIM in Table 1**.
We have used consistent metric computation and fixed all tables in our revision.
***References***
[Li et.al., 2020] Min Li, et al., "Multi-view photometric stereo: A robust solution and benchmark dataset for spatially varying isotropic materials." IEEE Transactions on Image Processing, 2020
[Zhang et.al., 2022] Kai Zhang, et.al., "IRON: Inverse rendering by optimizing neural sdfs and materials from photometric images", In CVPR, 2022