Dear Reviewer HhWA,
Thank you again for your feedback. As the deadline for discussion is approaching, we would be happy to provide any additional clarifications that you may need.
In our previous comments, we have carefully studied your comments and made updates to the revision as summarized below:
* Provided a discussion how we enable generalization in the absence of multiview consistency via local geometric features.
* Added a discussion on the advantage of using an implicit function compared to using the dataset.
* Clarified in the paper that the grid features are learned, and provided additional details on how we initialize the grid.
* Modified our notation to distinguish the original impulse response, and the STFT representation of the impulse response at a specific magnitude/phase index
* Added a visualization using nearest-neighbor and linear interpolation baselines
* Provided a description of our variables used in each equation, and the dimension of each variable.
* Clarified the objective used in our NeRF only training, and the objective used when training NeRF and NAF jointly.
Please let us know if you have any questions remaining. We would be happy to do anything that would be helpful in the time remaining!
Thank you for your time!
Best,
Authors
===========================
Dear Reviewer U1r6,
Thank you again for your suggestions. As the deadline for the discussion is coming up, we would be happy to address any remaining questions.
In our previous response, we have carefully studied your suggestions and made updates to our revision, which we summarize below:
* We clarify the scope and goal of our work in the introduction.
* We provide additional technical details on the dimension and initialization of the grid.
* We added an experiment to the Appendix on blending the left/right latent.
* We now detail the dimension of each variable, and modify the notation to make the meaning of each variable more obvious.
* We now clarify which version of our NAF has a grid in our paper.
* We provide additional details on how the waveform can be recovered without learning the phase.
* We added more details on how we perform temporal padding of the training data.
* We added additional details on how we select the test set.
* We have clarified our use of Eqn 8. and modified the language to clarify how we perform localization.
We would like to know if you have any additional comments or suggestions. We’d be very happy to do address any remaining questions that we can in the time remaining!
Thank you for your suggestions!
Best,
Authors
=======================================
# V59y
<!-- Thank you for your comments. We are glad to see that we have addressed your main concerns on modeling phase and appreciate your comments on our work. We are happy to address any remaining concerns with additional clarifications followed by paper revision! -->
Thank you for your comments. We have worked hard to address your concerns, and we have incorporated your suggestions in this revision. We are happy to address any remaining concerns with additional clarifications followed by paper revision.
> **Comparison with simple baselines.**
We have provided a comparison against two simple but strong baselines (nearest neighbor and linear interpolation) and show quantitative results in Table 1., with qualitative results shown in Figure A5 of the Appendix. We demonstrate that we achieve lower error when measured in magnitude-STFT, and achieve lower average error when measured using T60 as a metric. We are currently working on additional qualitative results for our website, this will take one or two days.
<!-- We have provided a comparison against two simple but strong baselines (nearest neighbor and linear interpolation) and the quantitative results are shown in Table 1., with qualitative results shown in Figure A5 of the Appendix. We demonstrate that we achieve lower error when measured in magnitude-STFT, and achieve lower average error when measured using T60 as a metric. We are currently working on additional qualitative results for our website, which will take one or two days. -->
> **Advantages of our NAFs over interpolation baselines.**
A further important advantage is the compact nature of the implicit representation. On average our NAFs use 0.5% the storage of these interpolation baseline methods as we show in Table A1 of the Appendix. The compact nature of the implicit representation means that we can encode and utilize spatial audio even when we cannot store the full amount of spatial audio data (<30MB for NAFs, several GBs for interpolation).
> **Limitations and future work.**
In terms of only modeling the magnitude of the STFT for a spatial impulse response, precedent can be found in the Image2Reverb (ICCV 2021) paper. They describe modeling the magnitude-only STFT (and sample random phase) as a way to model the spatial impulse response in high quality.
We agree that modeling the magnitude alone cannot account for all the information in the impulse response. We will clarify our goal and the limitations of our work in an upcoming revision.
> **Consistency on paper writing.**
As the first work in modeling the continuous spatial audio in a scene, we strive to make this framework general to public audiences and inspire follow-up work to explore more in this new emerging field. We understand your concerns, and are happy and able to make the change in the text to make our application and claim more appropriate.
<!-- Below is a summary of change we are planning to incorporate in the final version. -->
Because we are no longer able to update the original paper, we **provide an updated revision on [our website](https://sites.google.com/view/nafs-iclr-2022/home).** We have modified our language in the introduction to reflect our goal is to model plausible spatial audio, We have also modified our conclusion to discuss the limitations of our current model, and potential avenues for future work.
Please feel free to let us know if you have additional comments!
<!-- Thank you for your comments.
Early in the project, we reached out to the authors behind the recent SIGGRAPH papers in Microsoft's Project Triton. However the code and data could not be released under an academic license. In their work, their primary concern is to achieve a compact encoding. Because we utilize a learning representation, our approach has a fixed storage cost regardless of scene size and complexity.
As an additional baseline, we propose to add a comparison where the impulse responses undergo lossy compression with a modern audio codec (libopus) or a modern image codec (JPEG/AV1/HEVC) applied to the STFT. This would allow us to compare our approach to when the room acoustics are compactly encoded for VR/gaming applications.
As far as we are aware, T60 error is the primary qualitative metric used to evaluate the output in recent papers that deal with learned spatial audio (Image2Reverb, Deep Acoustics), and we provide T60 error as a metric following your feedback.
We will further refine our claims in a revision to our paper. -->
# U1r6
Thank you for the response.
> **How is linear interpolation computed?**
As we clarify in the caption for Table 1., we actually adopt a stronger baseline through interpolation in the time/waveform domain. Prior work mentions interpolation in both the time and frequency domain as valid approaches [1]. We utilized linear interpolation in the log-magnitude STFT domain initially. However, after performing the interpolation in time domain, we observe lower MSE error and T60 error compared to interpolation in the log-magnitude STFT and utilizing Griffin-Lim for phase recovery for the T60 metric. For this reason, we believe that time domain interpolation is the stronger baseline.
> **Phase and spatial audio.**
For the purposes of gaming/VR tasks, past work for spatial audio representations generally also do not model phase. In particular, prior work for spatial audio representation: Pulkki's DirAC [2], Raghuvanshi's parametric coding [3], Drori's image2reverb [4] do not encode phase and construct either random phase filters, minimum-phase filters or sample a random phase for log-magnitude STFT reconstruction respectively.
> **Speech dereverberation.**
Like other work, we model the spatial sound by convolving a clean source with the impulse generated by our system. We do not explore dereverberation via deconvolution in this work. But we agree that direct deconvolution using our magnitude only representation with inferred phase would not yield high quality dereverberation. A possible approach to produce perceptually reasonable de-reverberant speech/audio could be learning a network to produce dry speech/audio conditioned on the reverberant audio STFT and the NAF predicted magnitude STFT impulse response [5], instead of performing blind dereverberation. In this case, we can reconstruct the time-domain signal from this clean magnitude spectrogram estimate using Griffin-Lim or a neural vocoder (wavenet). Our work takes the first step on modeling magnitude of impulse responses, and we believe this will inspire many follow-up works on this exciting direction.
> **Additional qualitative results.**
Thank you for this great suggestion. We are currently working on additional qualitative results, this will take one or two days. We will provide an update once the results are posted.
Please let us know if you have any additional questions!
[1] Raghuvanshi, Nikunj, et al. "Precomputed wave simulation for real-time sound propagation of dynamic sources in complex scenes." SIGGRAPH (2010).
[2] Pulkki, Ville. "Directional audio coding in spatial sound reproduction and stereo upmixing." Audio Engineering Society Conference (2006).
[3] Raghuvanshi, Nikunj, and John Snyder. "Parametric wave field coding for precomputed sound propagation." SIGGRAPH (2014).
[4] Singh, Nikhil, et al. "Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis." ICCV (2021).
[5] Han, Kun, et al. "Learning spectral mapping for speech dereverberation and denoising." TASLP (2015).
<!-- A possible approach to produce a perceptually reasonable de-reverberant speech/audio could be performing division of the reverberant STFT magnitude with the predicted RIR frequency magnitude to obtain an estimate of real component of the dry speech/audio [5] in the frequency domain. In this example, we can create the time-domain signal from this magnitude spectrogram estimate using Griffin-Lim or a neural vocoder (wavenet). -->
<!-- It is still possible to produce a perceptually-reasonable de-reverberant speech/audio using the proposed method. For example, one possibility is to divide the reverberant magnitude spectrogram with the RIR magnitude spectrogram to obtain an estimated magnitude spectrogram of dry speech/audio. Then we can create the time-domain signal from this magnitude spectrogram estimate using Griffin-Lim or other vocoders. We expect that signals created in this way are of reasonable perceptual quality. -->
=========================
# HhWA
Dear Reviewer HhWA,
We deeply appreciate your feedback. Our goal has not changed from our initial revision, and that is to learn a continuous representation of spatial acoustics from sparse training samples.
The link we provided in the original paper included many qualitative results for our work, which we have further augmented with a direct comparison between our network and interpolation baselines:
[https://sites.google.com/view/nafs-iclr-2022](https://sites.google.com/view/nafs-iclr-2022/home)
Generalization to novel locations remains an open question in both visual and acoustic models using implicit functions. As we show in the website, interpolation using the dataset alone cannot yield satisfactory results. We show in our paper that learning local spatial features can help in the absence of multiview consistency (as in vision).
As the first work in modeling the continuous spatial audio in a scene, we hope to make this framework general for public audiences and inspire follow-up work to explore more in this new emerging field. We hope you can consider a more positive evaluation given to our work.
And honestly, we are really uncomfortable about your concerns on “overclaim”. It seems that all reviewer ignore the website posted on our first draft, and thus misunderstand the goal of our paper. This is clearly not our faults. We have offered all the clarifications, addressed all of the concerns, and updated the draft accourdingly. We would much appreaciate and reconsider changing your score to the positive side.
Thanks for your time!
Best,
Authors
# HhWA v2
Dear Reviewer HhWA,
We deeply appreciate your feedback, and are grateful that you consider our work to be interesting. Our goal has never changed from our initial revision, and that is to learn a continuous representation of spatial acoustics from sparse training samples.
The link we provided in the original paper included many qualitative results for our work, which we have further augmented with a direct comparison between our network and interpolation baselines: [https://sites.google.com/view/nafs-iclr-2022](https://sites.google.com/view/nafs-iclr-2022/home)
As we show on the website, interpolation using the dataset alone cannot yield satisfactory results.
We respectfully push back on the concerns of "overclaim". We **never** claim we can generalize to unseen scenes. In the context of neural implicit representations, usually the goal is to learn a representation that can generalize to unseen locations/views for a *single scene*, after training from sparse views from the same scene. In the absence of multiview consistency used in vision, we demonstrate that learning local spatial features is an applicable alternative for spatial acoustics.
In our very first draft, we included a website with qualitative results where we demonstrate unequivocally that our networks can continuously infer the spatial audio at unseen locations. We have further offered the clarifications and addressed the concerns as requested, and have updated the revision accordingly. As the first work in modeling the continuous spatial audio in a scene, we hope to make this framework sufficiently general for public audiences and inspire follow-up work to explore more in this new emerging field.
We hope you can consider a more positive evaluation of our work.
Thank you for your time!
Best,
Authors