V59y - HackMD

Dear Reviewer HhWA, Thank you again for your feedback. As the deadline for discussion is approaching, we would be happy to provide any additional clarifications that you may need. In our previous comments, we have carefully studied your comments and made updates to the revision as summarized below: * Provided a discussion how we enable generalization in the absence of multiview consistency via local geometric features. * Added a discussion on the advantage of using an implicit function compared to using the dataset. * Clarified in the paper that the grid features are learned, and provided additional details on how we initialize the grid. * Modified our notation to distinguish the original impulse response, and the STFT representation of the impulse response at a specific magnitude/phase index * Added a visualization using nearest-neighbor and linear interpolation baselines * Provided a description of our variables used in each equation, and the dimension of each variable. * Clarified the objective used in our NeRF only training, and the objective used when training NeRF and NAF jointly. Please let us know if you have any questions remaining. We would be happy to do anything that would be helpful in the time remaining! Thank you for your time! Best, Authors =========================== Dear Reviewer U1r6, Thank you again for your suggestions. As the deadline for the discussion is coming up, we would be happy to address any remaining questions. In our previous response, we have carefully studied your suggestions and made updates to our revision, which we summarize below: * We clarify the scope and goal of our work in the introduction. * We provide additional technical details on the dimension and initialization of the grid. * We added an experiment to the Appendix on blending the left/right latent. * We now detail the dimension of each variable, and modify the notation to make the meaning of each variable more obvious. * We now clarify which version of our NAF has a grid in our paper. * We provide additional details on how the waveform can be recovered without learning the phase. * We added more details on how we perform temporal padding of the training data. * We added additional details on how we select the test set. * We have clarified our use of Eqn 8. and modified the language to clarify how we perform localization. We would like to know if you have any additional comments or suggestions. We’d be very happy to do address any remaining questions that we can in the time remaining! Thank you for your suggestions! Best, Authors ======================================= # V59y  Thank you for your comments. We have worked hard to address your concerns, and we have incorporated your suggestions in this revision. We are happy to address any remaining concerns with additional clarifications followed by paper revision. > **Comparison with simple baselines.** We have provided a comparison against two simple but strong baselines (nearest neighbor and linear interpolation) and show quantitative results in Table 1., with qualitative results shown in Figure A5 of the Appendix. We demonstrate that we achieve lower error when measured in magnitude-STFT, and achieve lower average error when measured using T60 as a metric. We are currently working on additional qualitative results for our website, this will take one or two days.  > **Advantages of our NAFs over interpolation baselines.** A further important advantage is the compact nature of the implicit representation. On average our NAFs use 0.5% the storage of these interpolation baseline methods as we show in Table A1 of the Appendix. The compact nature of the implicit representation means that we can encode and utilize spatial audio even when we cannot store the full amount of spatial audio data (<30MB for NAFs, several GBs for interpolation). > **Limitations and future work.** In terms of only modeling the magnitude of the STFT for a spatial impulse response, precedent can be found in the Image2Reverb (ICCV 2021) paper. They describe modeling the magnitude-only STFT (and sample random phase) as a way to model the spatial impulse response in high quality. We agree that modeling the magnitude alone cannot account for all the information in the impulse response. We will clarify our goal and the limitations of our work in an upcoming revision. > **Consistency on paper writing.** As the first work in modeling the continuous spatial audio in a scene, we strive to make this framework general to public audiences and inspire follow-up work to explore more in this new emerging field. We understand your concerns, and are happy and able to make the change in the text to make our application and claim more appropriate.  Because we are no longer able to update the original paper, we **provide an updated revision on [our website](https://sites.google.com/view/nafs-iclr-2022/home).** We have modified our language in the introduction to reflect our goal is to model plausible spatial audio, We have also modified our conclusion to discuss the limitations of our current model, and potential avenues for future work. Please feel free to let us know if you have additional comments!  # U1r6 Thank you for the response. > **How is linear interpolation computed?** As we clarify in the caption for Table 1., we actually adopt a stronger baseline through interpolation in the time/waveform domain. Prior work mentions interpolation in both the time and frequency domain as valid approaches [1]. We utilized linear interpolation in the log-magnitude STFT domain initially. However, after performing the interpolation in time domain, we observe lower MSE error and T60 error compared to interpolation in the log-magnitude STFT and utilizing Griffin-Lim for phase recovery for the T60 metric. For this reason, we believe that time domain interpolation is the stronger baseline. > **Phase and spatial audio.** For the purposes of gaming/VR tasks, past work for spatial audio representations generally also do not model phase. In particular, prior work for spatial audio representation: Pulkki's DirAC [2], Raghuvanshi's parametric coding [3], Drori's image2reverb [4] do not encode phase and construct either random phase filters, minimum-phase filters or sample a random phase for log-magnitude STFT reconstruction respectively. > **Speech dereverberation.** Like other work, we model the spatial sound by convolving a clean source with the impulse generated by our system. We do not explore dereverberation via deconvolution in this work. But we agree that direct deconvolution using our magnitude only representation with inferred phase would not yield high quality dereverberation. A possible approach to produce perceptually reasonable de-reverberant speech/audio could be learning a network to produce dry speech/audio conditioned on the reverberant audio STFT and the NAF predicted magnitude STFT impulse response [5], instead of performing blind dereverberation. In this case, we can reconstruct the time-domain signal from this clean magnitude spectrogram estimate using Griffin-Lim or a neural vocoder (wavenet). Our work takes the first step on modeling magnitude of impulse responses, and we believe this will inspire many follow-up works on this exciting direction. > **Additional qualitative results.** Thank you for this great suggestion. We are currently working on additional qualitative results, this will take one or two days. We will provide an update once the results are posted. Please let us know if you have any additional questions! [1] Raghuvanshi, Nikunj, et al. "Precomputed wave simulation for real-time sound propagation of dynamic sources in complex scenes." SIGGRAPH (2010). [2] Pulkki, Ville. "Directional audio coding in spatial sound reproduction and stereo upmixing." Audio Engineering Society Conference (2006). [3] Raghuvanshi, Nikunj, and John Snyder. "Parametric wave field coding for precomputed sound propagation." SIGGRAPH (2014). [4] Singh, Nikhil, et al. "Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis." ICCV (2021). [5] Han, Kun, et al. "Learning spectral mapping for speech dereverberation and denoising." TASLP (2015).   ========================= # HhWA Dear Reviewer HhWA, We deeply appreciate your feedback. Our goal has not changed from our initial revision, and that is to learn a continuous representation of spatial acoustics from sparse training samples. The link we provided in the original paper included many qualitative results for our work, which we have further augmented with a direct comparison between our network and interpolation baselines: [https://sites.google.com/view/nafs-iclr-2022](https://sites.google.com/view/nafs-iclr-2022/home) Generalization to novel locations remains an open question in both visual and acoustic models using implicit functions. As we show in the website, interpolation using the dataset alone cannot yield satisfactory results. We show in our paper that learning local spatial features can help in the absence of multiview consistency (as in vision). As the first work in modeling the continuous spatial audio in a scene, we hope to make this framework general for public audiences and inspire follow-up work to explore more in this new emerging field. We hope you can consider a more positive evaluation given to our work. And honestly, we are really uncomfortable about your concerns on “overclaim”. It seems that all reviewer ignore the website posted on our first draft, and thus misunderstand the goal of our paper. This is clearly not our faults. We have offered all the clarifications, addressed all of the concerns, and updated the draft accourdingly. We would much appreaciate and reconsider changing your score to the positive side. Thanks for your time! Best, Authors # HhWA v2 Dear Reviewer HhWA, We deeply appreciate your feedback, and are grateful that you consider our work to be interesting. Our goal has never changed from our initial revision, and that is to learn a continuous representation of spatial acoustics from sparse training samples. The link we provided in the original paper included many qualitative results for our work, which we have further augmented with a direct comparison between our network and interpolation baselines: [https://sites.google.com/view/nafs-iclr-2022](https://sites.google.com/view/nafs-iclr-2022/home) As we show on the website, interpolation using the dataset alone cannot yield satisfactory results. We respectfully push back on the concerns of "overclaim". We **never** claim we can generalize to unseen scenes. In the context of neural implicit representations, usually the goal is to learn a representation that can generalize to unseen locations/views for a *single scene*, after training from sparse views from the same scene. In the absence of multiview consistency used in vision, we demonstrate that learning local spatial features is an applicable alternative for spatial acoustics. In our very first draft, we included a website with qualitative results where we demonstrate unequivocally that our networks can continuously infer the spatial audio at unseen locations. We have further offered the clarifications and addressed the concerns as requested, and have updated the revision accordingly. As the first work in modeling the continuous spatial audio in a scene, we hope to make this framework sufficiently general for public audiences and inspire follow-up work to explore more in this new emerging field. We hope you can consider a more positive evaluation of our work. Thank you for your time! Best, Authors

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.