owned this note
owned this note
Published
Linked with GitHub
# NAFs Neurips rebut
## Summary of our response and discussion
We genuinely thank all reviewers for their constructive comments which have contributed to the improvements in our paper. We sincerely appreciate the positive 8-8-8-6 evaluation from reviewers u3Tz, e4mi, ossn, and jrJA.
Here is a summary of our response.
### Contributions
We would like to first emphasize the contributions of this paper:
* We propose Neural Acoustic Fields, which render the sound for arbitrary emitter and listener positions in a scene. Our NAFs are represented as an implicit function, and outputs the log-magnitude and phase information for a given impulse response.
* We demonstrate that by conditioning the NAFs on a learnable spatial grid of features, we can improve the generalizability of our architecture.
* We show that NAFs can learn geometric structure of a scene that can be useful for downstream tasks.
### Additional experiments
* **[Interpolation baseline]** To address the concern of reviewer jrJA, we add additional experimental results using a "kernel ridge regression" baseline
* **[Additional RIR metric]** To address the concern of reviewer jrJA, we add quantitative results on DRR for our impulse responses.
* **[Spatial audio metric]** To address the concerns of u3Tz, we add an additional evaluation of the interaural cross correlation coefficient for our network and baseline outputs. We show that our network can better preserve spatial cues in the binaural impulse response than the baseline.
### Writing
We thank all reviewers for suggestions regarding our writing and clarity. We believe that the clarifications suggested by the reviewers will improve the communication of our work.
* We provide additional details about our network architecture [jrJA, u3Tz, e4mi], baseline setup [u3Tz, e4mi], and prior work [jrJA, u3Tz].
* We clarify that our NAFs are learned in time-frequency STFT domain, and provide additional details about our phase [u3Tz].
* We have provided additional details about our dataset [u3Tz].
We are deeply grateful to the reviewers for their helpful suggestions, which have helped improve our paper significantly. The additional experiments and clarifications will be reflected in the final version as well.
Best,
Authors
## General response (pre-revision)
We are grateful to all reviewers for their constructive comments which we agree will significantly improve the communication of our work.
We are very encouraged by reviewers’ evaluation on the significance and novelty of this work. All four reviewers find that our work on Neural Acoustic Fields (NAFs) to be novel (“This is a neat idea” (jrJA), “the general idea of NAFs is novel, interesting, and potentially impactful” (u3Tz), “the first, to my knowledge” (e4mi), “method appears to work well” (ossn)).
### 1. General clarifications
#### 1.1 Network details and reproducibility
Our method is fully reproducible. We have included a folder of our code, which contains hyperparameters, network architecture, and baselines as part of our supplementary material submitted. We hope the code will help the community reproduce our work and inspire later studies.
#### 1.2 Differences from prior work
We would first like to clarify that our work is concurrent with [1].
NAFs differentiates itself by learning a mapping for all possible emitter and listener locations in a scene. This is **fundamentally different** from prior work, which are in practice learned with a non-moving emitter or listener [1,2], or use handcrafted parameterizations of the sound field [3]. We demonstrate that by augmenting the network with shared geometric features that are shared by the emitter and listener, we can achieve a model that is better than a network not using, or using non-shared geometric features. We show that NAFs are significantly more compact than traditional audio coding baselines, and can achieve higher quality when evaluated on T60, spectral, DRR, or IACC error. We further show that audio representations learned by NAFs are informative of scene structure, making it a useful non-visual unsupervised scene representation.
### 2. Additional Experiments
The reviewers also suggest that additional metrics and baselines will make the paper stronger, highlight its strengths, clarify potential limitations, and outline directions for future work. We agree, and have augmented our revision with additional qualitative results. We have added an additional baseline, results for direct-to-reverberant ratio (DRR) to better characterize the early components, as well as results for the interaural cross correlation coefficient (IACC) to characterize the spatial cues. We provide these metrics here, and will include them in the revision.
#### 2.1 Interpolation baseline
Here we compare the method proposed in [4] on the MeshRIR dataset. Where "Constrained-Orig" uses the 500Hz low pass filter as used by the authors, while "Constrained-Unfiltered" is our modification which uses the non-filtered impulse response.
| | Spectral | T60 | DRR |
|----------------------|-----------|-----------|-----------|
| Constrained-Orig | 2.539 | 8.192 | 2.497 |
| Constrained-Unfiltered | 1.370 | 6.294 | 3.702 |
| NAF (Dual) | **0.403** | 4.201 | 0.992 |
| NAF (Shared) | **0.403** | **4.191** | **0.972** |
#### 2.2 DRR metric
The Direct-to-reverberant ratio is used to measure the ratio of the energy coming from direct region. We find that NAFs have lower DRR error than baseline methods.
| | Large 1 | Large 2 | Medium 1 | Medium 2 | Small 1 | Small 2 | MeshRIR | Mean |
|--------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| AAC-nearest | 1.748 | 2.424 | 1.344 | 1.343 | 1.213 | 1.108 | 1.286 | 1.495 |
| AAC-linear | 1.797 | 2.147 | 1.457 | 1.458 | 1.117 | 1.226 | 1.222 | 1.490 |
| Opus-nearest | 2.931 | 3.275 | 2.756 | 2.769 | 3.548 | 3.255 | 2.698 | 3.033 |
| Opus-linear | 2.645 | 2.771 | 2.381 | 2.370 | 3.266 | 2.882 | 2.529 | 2.692 |
| DSP | 3.559 | 4.421 | 4.727 | 4.805 | 5.622 | 6.723 | N/A | 4.976 |
| NAF (Dual) |1.645 | 1.830 |1.113 |**1.082** | **0.796** |**0.799** | 0.992 |1.179 |
| NAF (Shared) | **1.468** | **1.793** | **1.083** | 1.089 | 0.829 | 0.837 | **0.972** | **1.153** |
#### 2.3 IACC metric
The interaural cross correlation coefficient is used to measure the spatial localization from impulse responses, and is correlated with localization performance in humans. We find that NAFs achieve the lowest IACC error on average.
| | Large 1 | Large 2 | Medium 1 | Medium 2 | Small 1 | Small 2 | Mean |
|--------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| AAC-nearest | 236.8 | 184.2 | 213.7 | 215.3 | 264.8 | 272.5 | 231.2 |
| AAC-linear | 212.3 | 156.7 | 185.9 | 187.8 | 245.2 | 265.2 | 208.8 |
| Opus-nearest | 73.75 | 45.97 | 71.97 | 74.70 | 103.8 | **67.40** | 72.93 |
| Opus-linear | 75.56 | 48.32 | 73.38 | 77.33 | 109.2 | 78.10 | 76.98 |
| DSP | 460.5 | 446.0 | 430.0 | 430.1 | 443.6 | 446.3 | 442.7 |
| NAF (Dual) | 74.01 | 45.94 | 71.89 | 74.70 | 103.8 | **67.40** | 72.96 |
| NAF (Shared) | **73.68** | **45.90** | **71.52** | **73.58** | **103.6** | **67.40** | **72.62** |
* Mean absolute difference of IACC (unit in seconds, values here multiplied by 1e6). Lower is better.
### Conclusion
We thank the reviewers for their careful feedback and additional suggestions for evaluation, which will make the paper significantly stronger.
[1] Richard, Alexander, et al. "Deep Impulse Responses: Estimating and Parameterizing Filters with Deep Networks." (2022)
[2] Richard, Alexander, et al. "Neural synthesis of binaural speech from mono audio." (2020)
[3] Chaitanya, Chakravarty R. Alla, et al. "Directional sources and listeners in interactive sound propagation using reciprocal wave field coding." (2020)
[4] Ueno, Natsuki, et al. "Kernel ridge regression with constraint of Helmholtz equation for sound field interpolation." (2018)
## Reviewer jrJA
We are encouraged by your assessment that modeling scene acoustics is an important question and that our approach is a novel one. We thank Reviewer jrJA for the detailed and constructive review. Below are our responses to specific comments. We look forward to further discussion, and are happy to answer any questions.
> **Q1) Comparison to previous sound field models**
<!-- TODO: Add specific discussion about their paper-->
We agree it is important to highlight the difference between NAFs and past work.
Prior work has proposed both parametric and non-parametric methods to interpolate the sound field. Parametric methods typically only capture perceptually relevant cues, while non-parametric methods seek to estimate the sound field itself. Models have chosen to model the sound field as a linear composition of spherical or plane wave expansions. Methods similar to [1] typically leverage priors or assumptions about the sound field, such as physical constants, far field sound sources, or the position of the receivers. While these assumptions may hold true in certain settings, acoustic environments can be complex and deviate from model priors. Unlike these traditional approaches, our NAFs are learned from data. Furthermore, different from past approaches which typically estimate a sound field for a fixed source, our NAFs enable the arbitrary positioning of both the source and receiver.
Since there is no public implementation of [1], here we provide additional quantitative results using the method described in "Kernel Ridge Regression With Constraint of Helmholtz Equation for Sound Field Interpolation" [2] on the MeshRIR dataset. We use two different variants of the model, the first using their original parameters which include a 500Hz low-pass filter, and the second where we modify the model to use the unfiltered RIR. We use their original proposed regularization value of 0.1.
| | Spectral | T60 | DRR |
|----------------------|-----------|-----------|-----------|
| Ridge-Orig | 2.539 | 8.192 | 2.497 |
| Ridge-Unfiltered | 1.370 | 6.294 | 3.702 |
| NAF (Dual) | **0.403** | 4.201 | 0.992 |
| NAF (Shared) | **0.403** | **4.191** | **0.972** |
We find our model consistently out performs this baseline. We will include a discussion of [1, 2] and related methods in our updated revision.
> **Q2) Differences in parameterization to "Deep Impulse Responses (DIRs)"**
We want to first clarify that our work is concurrent with [3].
Both [3] and our work parameterize the impulse response as an continuous implicit function. However, DIRs assume a stationary source or receiver, and in practice they focus on a static receiver with emitters distributed on the sphere. NAFs allow both the source and receiver to move freely within a room, but requires us to model a much larger and complex set of impulse responses. This is a fundamentally more challenging problem.
An additional difference is our parameterization of the output. NAFs parameterize the output as log-magnitude and instantaneous frequency (phase) [4], while DIRs output a time domain waveform directly. We experimented with using the representation and MSE training loss as proposed in DIRs, and these results are presented in section **H** of the revised supplementary.
We observed that while outputting the waveform succeeds when modeling a small subset of the impulse responses, the network would only output an over-smoothed waveform when modeling an entire scene. We experimented with increasing the frequency of the fourier features, as this has been suggested to improve the ability of the network to model high frequency data [5]. However we found that this would introduce high frequency noise into the predicted impulse response. This led us to adopt an STFT based output representation. Prior work on using implicit networks for audio representations have similarly modeled either the log-magnitude of the STFT or the full magnitude-phase STFT [6, 7].
> **Q3) Results for the Direct-to-Reverberant Ratio**
We agree that Direct-to-Reverberant Ratio (DRR) is a useful metric for characterizing room impulse responses. Here we present the mean absolute error of the DRR for each method:
| | Large 1 | Large 2 | Medium 1 | Medium 2 | Small 1 | Small 2 | MeshRIR | Mean |
|--------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| AAC-nearest | 1.748 | 2.424 | 1.344 | 1.343 | 1.213 | 1.108 | 1.286 | 1.495 |
| AAC-linear | 1.797 | 2.147 | 1.457 | 1.458 | 1.117 | 1.226 | 1.222 | 1.490 |
| Opus-nearest | 2.931 | 3.275 | 2.756 | 2.769 | 3.548 | 3.255 | 2.698 | 3.033 |
| Opus-linear | 2.645 | 2.771 | 2.381 | 2.370 | 3.266 | 2.882 | 2.529 | 2.692 |
| DSP | 3.559 | 4.421 | 4.727 | 4.805 | 5.622 | 6.723 | N/A | 4.976 |
| NAF (Dual) |1.645 | 1.830 |1.113 |**1.082** | **0.796** |**0.799** | 0.992 |1.179 |
| NAF (Shared) | **1.468** | **1.793** | **1.083** | 1.089 | 0.829 | 0.837 | **0.972** | **1.153** |
* Mean absolute error of the DRR, units in dB. Lower is better.
Note the DSP baseline was not implemented for MeshRIR due to the lack of absolute coordinates.
We thank the reviewer for the suggestions, and have added additional quantitative comparisons with a sound field interpolation method alongside DRR results. Following your suggestion, we have also reduced the length of section 3.1 in the revision. We will include additional discussion and add these results to the revision.
[1] Antonello, Niccolo, et al. "Room impulse response interpolation using a sparse spatio-temporal representation of the sound field." (2017)
[2] Ueno, Natsuki, et al. "Kernel ridge regression with constraint of Helmholtz equation for sound field interpolation." (2018)
[3] Richard, Alexander, et al. "Deep Impulse Responses: Estimating and Parameterizing Filters with Deep Networks." (2022)
[4] Engel, Jesse, et al. "Gansynth: Adversarial neural audio synthesis." (2019).
[5] Tancik, Matthew, et al. "Fourier features let networks learn high frequency functions in low dimensional domains." (2020)
[6] Gao, Ruohan, et al. "Objectfolder: A dataset of objects with implicit visual, auditory, and tactile representations." (2021)
[7] Du, Yilun, et al. "Learning signal-agnostic manifolds of neural fields." (2021)
## Reviewer u3Tz
We appreciate your assessment that the NAFs are a novel and interesting idea. We thank Reviewer u3Tz for the helpful review. Below are our responses to specific comments.
> **Q1) Evaluation on binaural/spatial rendering**
We agree that binaural cues are important and should be reflected in our evaluations. The interaural cross correlation coefficient (IACC) is a commonly accepted metric for the spatial localization of sound sources from binaural audio [1], and is believed to be predictive of human localization of sound sources [2]. The IACC coefficient is computed for each binaural impulse response, and the mean absolute difference between our predicted and ground truth IACC is taken.
| | Large 1 | Large 2 | Medium 1 | Medium 2 | Small 1 | Small 2 | Mean |
|--------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| AAC-nearest | 236.8 | 184.2 | 213.7 | 215.3 | 264.8 | 272.5 | 231.2 |
| AAC-linear | 212.3 | 156.7 | 185.9 | 187.8 | 245.2 | 265.2 | 208.8 |
| Opus-nearest | 73.75 | 45.97 | 71.97 | 74.70 | 103.8 | **67.40** | 72.93 |
| Opus-linear | 75.56 | 48.32 | 73.38 | 77.33 | 109.2 | 78.10 | 76.98 |
| DSP | 460.5 | 446.0 | 430.0 | 430.1 | 443.6 | 446.3 | 442.7 |
| NAF (Dual) | 74.01 | 45.94 | 71.89 | 74.70 | 103.8 | **67.40** | 72.96 |
| NAF (Shared) | **73.68** | **45.90** | **71.52** | **73.58** | **103.6** | **67.40** | **72.62** |
* Table 1. Mean absolute difference of IACC (unit in seconds, values here multiplied by 1e6). Lower is better.
Our method has the lowest IACC error, which indicates that our method is capable of rendering spatial audio. We include this important metric in our revised paper. Thank you for your valuable suggestions.
> **Q2) Technical clarifications**
* **[Cost of ray tracing]** Soundspaces does not use 200 rays, but instead uses [5000 rays \* 200 bounces] for each listener, and [200 rays \* 10 bounces] for each emitter. We should clarify that in our paper we mean ray tracing in the context of a learned implicit neural representation of scene structure. Due to the computational cost, current state-of-the-art work in ray tracing in implicit neural representations is limited to a single bounce [3]. We will clarify this in the revision.
* **[Discussion of prior work]** We agree that [4] is an important work in modeling binaural audio. However, the approach of our model and [4] are different. While [4] seeks to output the binaural audio directly, we output an impulse response which can be applied to mono audio. Secondly, NAFs model the STFT (log-magnitude and instantaneous frequency of phase), whereas [4] learns the time domain waveform. Finally, in practice [4] is trained and evaluated on data where the listener is fixed and only the emitter can move, while the NAF model is trained and evaluated on listener and emitter pairs which can both move. This requires the modeling of a much larger set of impulse responses. We did attempt to adapt [4] to our task by using an impulse function as input, and the impulse response as supervision. We could not successfully learn the impulse response in this modified setup, probably because their network was not tuned for this task. We will include additional discussion of [4] in our revision.
* **[Instantaneous frequency]** We use the STFT phase instantaneous frequency representation proposed in GANSynth [5], which retains the phase for each frequency band. After the STFT for a waveform is computed, the phase angle within each frequency band is extracted and unwrapped over the 2π boundary, and the finite difference is taken over the time dimension. To get back to the time domain waveform, we take the cumulative sum of the instantaneous frequency over the time dimension within each frequency band. This is recombined with the magnitude, and is passed through inverse STFT. This recovers the exact same waveform as the input. The `get_wave_2` function provided in `testing/test_utils.py` in our supplementary shows how we recover the waveform. We thank the reviewer for helping us clarify this point.
* **[Bitrates of baselines]** We believe you may have misinterpreted our results presented in Table 3. To clarify, our NAFs being smaller than the baselines is an ***advantage***, and is not a sign of NAFs being inferior to the baseline. The Opus and AAC baselines perform worse than NAFs despite being 20x and 40x the size. The code we used for implementing the baselines are in `baselines/make_data_aac.py` and `baselines/make_data_opus.py`, and was provided during the initial submission. We have also detail the version of the encoders we use in section F of our supplementary. We used libopus 1.3.1 and ffmpeg 5.0 native aac as the respective encoders. libopus was set to use maximum complexity for best quality, use music mode for better wideband performance, and use constrained variable bitrate mode. aac was set to use constant bitrate mode. In the revision, we will also mention in the main paper where to find the baseline details.
> **Q3) Other clarifications**
* **[NAFs and the time-frequency domain]** Thank you for pointing this out! We correct this in the revision in L48 and L114.
* **[Time domain output]** Yes, L113 should just indicate the time domain waveform.
* **[Figure of the shared grid network]** Our intention was to show the shared grid network in Figure 2. of our main paper, since it was the best performing architecture. We will highlight in Figure 2. that we are showing the "shared grid" design, and further include this figure in the supplementary to provide a better comparison.
* **[Dataset details]** We discussed the restricted parameterization of SoundSpaces in section 4.2, and note that it is restricted to a 2D plane. In the revision, we will move the specifics about both SoundSpaces and MeshRIR into section 4.1.
We thank the reviewer u3Tz for providing detailed and thoughtful feedback. Following their suggestions, we have run an evaluation to measure how well our framework preserves the binaural cues. We would like to highlight that code is provided in our supplementary for reproducibility. We do note that there may have been a misunderstanding regarding the size of our NAFs, and hope that our clarifications will aid the reviewer in their final evaluation, particularly in light of our additional results.
[1] Rafaely, Boaz, et al. "Interaural cross correlation in a sound field represented by spherical harmonics." (2010)
[2] Andreopoulou, Areti, et ak. "Identification of perceptually relevant methods of inter-aural time difference estimation." (2017)
[3] Srinivasan, Pratul P., et al. "Nerv: Neural reflectance and visibility fields for relighting and view synthesis." (2021)
[4] Richard, Alexander, et al. "Neural synthesis of binaural speech from mono audio." (2020)
[5] Engel, Jesse, et al. "Gansynth: Adversarial neural audio synthesis." (2019).
## Reviewer e4mi
We thank Reviewer e4mi for the helpful and constructive review. We address specific questions below, and will include additional details in a revision.
> **Q1) Network details**
The feature grid contains 64 features at each location, and is initialized from the gaussian distribution. In the case where individual grids are used for the emitter and listener, two grids are initialized independently. The network consists of 8 fully connected layers, and leaky ReLU with a slope of 0.1 is used as the activation function. The network has two output neurons, representing log-magnitude and instantaneous frequency (phase). Each fully connected layer uses 512 intermediate feature maps. The network is trained using the Adam optimizer with an initial learning rate of 5e-4, which decays to 5e-5 at the end of the training. The code definition for the network is provided as part of our supplementary. We will update our supplementary to better detail our hyperparameters and setup.
> **Q2) Baseline and visualization details**
The impulse responses are indeed processed directly using the baseline encoders. This choice was motivated by our desire to have a set of impulse responses that could be applied to arbitrary sounds. It would be possible to encode the post-convolution audio, however that would sacrifice the ability to generalize. The specific code we used to encode our data is provided in the `baselines` folder of the supplementary code. For the loudness visualization we compute the root mean square of the impulse. Code for the visualization can be found in `testing/vis_loudness_NAF.py`.
> **Q3) Directional sounds**
For our qualitative demos, many of the instances have the emitter placed quite far away from the listener. In cases where the emitter is not in the same room as the listener, the reverberation of the sound is more obvious, while the directional nature of the sound is less so. In cases where the listener is immediately outside of the doorway, the directional aspect should be most evident (eg. Emitter location 1 in Large 2 at around 0:24; Emitter location 2 in Large 2 at around 0:10; Emitter location 1 in Large 1 at around 0:32). The use of headphones may better highlight the directional effect. We agree that losses explicitly designed for maintaining directional cues are worth exploring.
We thank you for your comments, and we hope that this clarifies our results! We will update the paper to reflect your suggestions.
## Reviewer ossn
We are grateful to Reviewer ossn for the suggestions and comments. We address specific comments below.
> **Q1) Quantitative and qualitative metrics**
As part of the revision, we have additionally provided direct-to-reverberant ratio (DRR) error and interaural cross-correlation (IACC) coefficients error. The former should reflect how well we model the direct sound, while the latter should reflect binaural spatialization. In addition, we performed a human evaluation where subjects were provided with headphones and asked to perform a two-alternative forced-choice task, where over 82.38% found our NAFs to outperform the AAC-nearest baseline. We also provide qualitative samples on our project site: https://sites.google.com/view/nafs-neurips2022
> **Q2) Visualization of the spectrograms**
All our spectrograms are presented with frequency on the vertical axis, and time on the horizontal axis. In Figure 3., (e)-(g) show the spectrogram of a long music sample that has been convolved. We have added axis labels and adjusted the orientation of our figure to improve clarity in the revision.
> **Q3) Societal impact**
Due to space constraints our societal impacts section was put on the last page of our supplemental. We will add a note in our revision to indicate where this section can be found.
To further clarify, the primary use case for our work lies in virtual reality and gaming. As our work can lead to more believable and higher quality representations of spatial audio than alternative methods, it is possible that our work could increase the dependency and time spent on gaming.
Thank you for your comments! We will address your feedback in the revision.