## General response (pre-revision) We thank all reviewers for their constructive comments. We are very encouraged by reviewers' evaluation on the novelty and significance of this work. All four reviewers find that our work on Neural Acoustic Fields (NAFs) is novel and interesting ("proposed idea is novel" (doFB), "is new and interesting" (HhWA), "very interesting" (U1r6), "representing RIRs is a big deal" (V59y)). All reviewers pointed out that we should better present the work. We totally agree. Indeed, we now note some obvious items that we can immediately improve. For example, although we have provided a demo to illustrate the qualitative results, we should have highlighted the link so that the reviewers can have an intuitive experience of our work. We will provide additional qualitative and quantitative results in an upcoming revision. By adding some new data, providing more detailed descriptions, and clarifying the key issues, the work will be substantially improved, and **we are very confident that all reviewers' concerns will be fully addressed**. First, we would like to make several clarifications on the common concerns. ## 1. General Clarification ### 1.1 Qualitative results Several reviewers asked for demos of qualitative results. We actually provided a link in the original submission of our paper to the demos, which we list below again: https://sites.google.com/view/nafs-iclr-2022. These demos are important because they show that, after training using sparse data, we can continuously infer the sound at arbitrary locations and orientations in a scene. Therefore, they answer reviewers' questions if our approach can learn realistic acoustic effects including reverberation, decay, and portaling effects as the listener traverses the scene. In our revision we will highlight the link. ### 1.2 Scope and goal of our work Reviewers recommended that we be explicit about the scope and goal of this study. Here, we study how to represent sound propagation in an individual scene as a continuous implicit function. * We are the first to propose learning the complete continuous sound field of a scene. Our approach uses only emitter/listener position & orientation/left-right as input, and is supervised with sparse impulse responses. * We present an architecture that improves the generalization capability when learning sound propagation in lieu of a vision like photometric loss * Given this learned representation, we demonstrate that we can render the propagated sound at continuous locations in a scene for arbitrarily positioned emitters (see **section 1.1**). * We further demonstrate that our NAFs learn a representation useful for visual rendering as well. ### 1.3 Memorization and interpolation Our initial submission led to confusion among two reviewers that our model achieved the performance by mere memorization. This is not the case at all. In vision based models (NeRF, SRN, etc.), generalization to unobserved viewpoints is achieved by enforcing a multiview consistent constraint via photometric loss [1, 2]. However the same assumptions cannot be made in sound propagation. Instead because the anisotropic reflections are strongly affected by geometry close to the listener and emitter [3], we propose to learn the necessary local geometric properties of a scene. In our paper we include strong nearest-neighbor and linear interpolation baselines. Despite requiring magnitudes more storage at inference (up to tens of gigabytes compared to tens of megabytes for our method), these interpolation-based memorization approaches perform significantly worse when evaluated at unobserved locations. Furthermore, the nearest-neighbor baseline does not allow for smooth changes when provided with sparse training samples, while linear baselines are well understood to have audible artifacts [4]. Our qualitative results provided in **1.1** demonstrate that our proposed framework can capture the acoustic effects at continuous locations in the scene. We will update our website with audible samples with these baselines, and run a human study to quantify the performance of these different methods. ### 1.4 Learning a magnitude only STFT representation Three reviewers asked us to clarify how the time domain signal can be recovered given that we only learn the magnitude of the STFT. This is indeed not very straightforward. We will provide a much better clarification on how acoustic rendering can be achieved using our network. * Because the Short-time Fourier transform (STFT) is computed with overlapping windows, the phase information can be recovered by using the redundancy of the magnitude representation. * In our paper we employ the widely used iterative Griffin-Lim algorithm [5] to reconstruct the phase from the magnitude of the STFT, and infer the time domain impulse response. * Once we infer the impulse, we render the sound by convolving the source signal utilizing convolution. Most prior works on spatial impulse response modeling do not model phase information. During the synthesis stage - learned approaches [6], fixed position approaches [7], and approaches that allow for listener/emitter movement [3] respectively use either random phase with magnitude-STFT of the impulse, random phase filters, or minimum phase filters respectively. In the context of spatial impulse responses, for lower frequencies (below ~2kHz) the time delay provides more important spatial cues [8,9], while at higher frequencies the phase is generally not perceptible [10]. We capture the time delay by utilizing the STFT representation, which captures time-localized frequency information. We will update our paper to include these details. Our framework is also compatible with jointly learning phase and magnitude, and we will provide results in a revised version of our paper. ## 2. Additional Results (pre-revision) ### 2.1 Time domain metric of our methods Reviewer V59y asked whether MSE on the STFT representation alone is sufficiently informative. While we note that several past works on learned audio modeling have utilized MSE/L1 on STFT to train or evaluate their models [11, 12], we also agree that we can better measure the quality of the generated impulse responses. Here we present results using the reverberation time (T60) metric, which is a widely used metric for quantifying room impulse responses and has been utilized in prior work [13, 14]. We show the average percentage difference for T60 between generated and ground truth impulses responses. The time domain representation is recovered using Griffin-Lim. | Model | Large 1 | Large 2 | Medium 1 | Medium 2 | Small 1 | Small 2 | Mean | |-------------------------|---------|---------|----------|----------|---------|---------|------| | NAF (Dual local feat) | 4.31 | 4.88 | 5.00 | 4.20 | 6.68 | 3.80 | 4.81 | | NAF (Shared local feat) | 4.16 | 5.24 | 3.80 | 3.88 | 6.75 | 3.77 | 4.60 | ### 2.2 Regularization of the NeRF latent Two reviewers asked us to provide details on the NAF+NeRF joint learning experiment. Here we present results where an L2 penalty with weight 1e-5 is applied to the NeRF grid to encourage a smooth latent under RGB only supervision. In this experiment we train with 75 images. | Large 1 | PSNR | MSE (1e-3) | Large 2 | PSNR | MSE (1e-3) | |---|---|---|---|---|---| | NeRF + L2 reg| 22.69 | 6.956 | NeRF + L2 reg| 24.86 | 7.128 | | NeRF | 25.41 | 6.618 | NeRF | 25.70 | 6.921 | ## 3. Planned Revisions ### 3.1 Additional experiments and clarification 1. We will add a time domain metric most often used to characterize room impulse responses (T60). 2. We will implement a network to jointly learn the magnitude and phase of the impulse, and analyze their performance using spectrogram and time domain metrics. 3. We will add a linear-probe decoding experiment to demonstrate that the learned grid features contain spatially useful information. 4. We will add additional qualitative results to our website for nearest-neighbor and linear impulse interpolation. 5. We will revise our paper carefully, including providing more technical details of our framework, giving specific information on how sound can be rendered from magnitude, describing model limitations, content adjustments, and grammar checking. 6. We will add a human study to compare our method against baseline approaches. ### 3.2 Reproducibility We will provide a link to a copy of our code during the review period. ### Conclusion We thank the reviewers for their helpful feedback and suggestions for additional evaluation, which will make the paper substantially stronger. We look forward to additional discussions. [1] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." ECCV (2020). [2] Sitzmann, Vincent, Michael Zollhöfer, and Gordon Wetzstein. "Scene representation networks: Continuous 3d-structure-aware neural scene representations." NeurIPS (2019). [3] Raghuvanshi, Nikunj, and John Snyder. "Parametric directional coding for precomputed sound propagation." SIGGRAPH (2018). [4] Raghuvanshi, Nikunj, et al. "Precomputed wave simulation for real-time sound propagation of dynamic sources in complex scenes." SIGGRAPH (2010). [5] Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE ASSP (1984). [6] Singh, Nikhil, et al. "Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis." ICCV (2021). [7] Pulkki, Ville. "Applications of directional audio coding in audio." ICA 2007. [8] Chaitanya, Chakravarty R. Alla, et al. "Directional sources and listeners in interactive sound propagation using reciprocal wave field coding." SIGGRAPH (2020). [9] Shinn-Cunningham, Barbara G., Scott Santarelli, and Norbert Kopco. "Tori of confusion: Binaural localization cues for sources within reach of a listener." JASA (2000). [10] Oxenham, Andrew J. "How we hear: The perception and neural coding of sound." Annual review of psychology (2018). [11] Shen, Jonathan, et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." ICASSP (2018). [12] Défossez, Alexandre, et al. "Sing: Symbol-to-instrument neural generator." NeurIPS (2018). [13] Singh, Nikhil, et al. "Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis." ICCV (2021). [14] Tang, Zhenyu, et al. "Scene-aware audio rendering via deep acoustic analysis." TVCG (2020). ### Reviewer doFB We thank Reviewer doFB for the detailed and constructive review. We address specific comments below, and will include additional details in a revision. > **Q1) How one can listen to the predicted impulse response when the estimate values are only the magnitude parts.** We provided qualitative audio results in this website included in the paper, reproduced below: https://sites.google.com/view/nafs-iclr-2022 We demonstrate that our model can continuously capture the acoustic effects (reverberation, directionality, portaling due to doors) of a 3D scene after training on sparse data. To compute the time domain impulse response from magnitude only, we utilize the redundancy present in the STFT representation to recover the phase: 1. Given the magnitude only STFT representation, we apply the widely used iterative Griffin-Lim [1] algorithm to reconstruct the phase. After reconstructing the phase, we can perform inverse STFT to derive the time domain (wave) representation of the impulse response. 2. After computing the time domain impulse response, we convolve with an audio sample to render the final result. This approach of modeling magnitude-only STFT and inferring phase with Griffin-Lim is also used in other recent work that learn audio generation [2, 3, 4]. We will include these details in an upcoming revision of the paper. > **Q2) How the visualizations are done in Figure 1 and 4.** For a given emitter position, we iterate over all positions in a scene for the listener and infer the STFT of the impulse response. Given this real magnitude component, we take the sum across all frequency bands up to a maximum time limit, and take the log to improve contrast for visualization. This approach is also used in the soundspaces paper for their loudness visualization [5]. > **Q3) Performance on unobserved locations?** We agree that the performance of NAF on unobserved speaker/emitter locations is important. * **[Evaluation on unobserved locations]** We would like to clarify that, in fact, our evaluation in Table 1. and Figure 5. is performed precisely on unseen combinations of emitter/listener locations. This demonstrates that our approach of leveraging a learned spatial representation generalizes better compared to interpolation baselines and MLP approaches even when provided with very sparse training data. * **[Generalization with learnable geometric attributes]** As is true for implicit visual learning approaches as well (NeRF, SRN, etc.), our model however does requires *some* observations to be able to operate in a room. In vision, we can leverage multiview consistency and learn a dense representation using photometric loss. However, we cannot assume multiview consistency in sound propagation. Instead we leverage the fact that geometry local to the listener/emitter strongly influence the anisotropic sound propagation [6]. Our framework learns a spatial representation which captures the local attributes. This in turn allows us to generalize to continuous locations from sparse training data. The learned features are visualized in Figure 6. <!-- Like the corresponding visual network (NeRF), our network cannot model completely unobserved portions of a scene. However, our approach leverages a learned spatial representations to capture the geometric attributes of the scene, which allows us to better generalize. We perform ablation studies against an MLP baseline in Figure X. with otherwise identical hyperparameters. We observe that our method with a learnable geometric conditioning performs better even in extremely sparse data regimes. --> > **Q4) Details of the sinusoidal encoding.** We note that in our initial submission, we did include details on the sinusoidal encoding on page 11 in the appendix. We can move this information to the main paper. We utilize 10 frequencies linearly spaced between 1 Hz and 100Hz/200Hz for position/time-frequency respectively. We did not observe performance improvements from further increasing the maximum frequency. Both sine and cosine functions were used following [7]. > **Q5) Details for reproducibility.** We will provide a copy of the code in the review period. We sincerely appreciate your comments. Please feel free to let us know if you have further questions. [1] Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE ASSP (1984). [2] Du, Yilun, et al. "Learning Signal-Agnostic Implicit Manifolds." NeurIPS (2021). [3] Wang, Yuxuan, et al. "Tacotron: Towards end-to-end speech synthesis." Interspeech (2017). [4] Ren, Yi, et al. "Almost unsupervised text to speech and automatic speech recognition." ICML (2019). [5] Chen, Changan, et al. "Soundspaces: Audio-visual navigation in 3d environments." ECCV (2020). [6] Raghuvanshi, Nikunj, and John Snyder. "Parametric directional coding for precomputed sound propagation." SIGGRAPH (2018). [7] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." ECCV (2020). <!-- [2] Singh, Nikhil, et al. "Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis." ICCV (2021). [3] Pulkki, Ville. "Applications of directional audio coding in audio." ICA 2007. [4] Raghuvanshi, Nikunj, and John Snyder. "Parametric directional coding for precomputed sound propagation." SIGGRAPH (2018). --> ## Reviewer V59y We are strongly encouraged by your evaluation that our work is the first on this topic, and could potentially be a big deal. We address specific comments below and **refer to the general response** for results. We will update the paper with the additional experiments and details. First, we want to note that we already provided qualitative audio results on the first page of our paper: https://sites.google.com/view/nafs-iclr-2022 We observe that our system can continuously model the directionality, loudness decay, and portaling effects of doors and windows as an agent moves in a scene. We will provide additional qualitative results that utilize linear and nearest neighbor baselines in an upcoming revision. We will highlight the link in an upcoming revision. > **Q1) Why use a STFT representation?** * **[Time domain modeling]** We acknowledge that the most straightforward approach would be direct modeling of the signal in the time domain. However, in initial experiments using the time domain, the models struggled to converge. Other papers using implicit networks for audio representations have also decided to either model the log-magnitude of the STFT or the full magnitude-phase STFT [1,2]. * **[Fourier encoding and implicit networks]** Empirically, we observed the maximum frequency modeled by the network is effectively upper-bounded by the maximum frequency of the fourier encoding when modeling in the time domain. However as demonstrated by [3], using higher frequencies in the encoding leads to noise in the representation. By factorizing the time domain information into a time-localized frequency dependent representation, we have a representation that is easier to model. The choice of modeling phase in STFT is task specific [4,5,6], and does not change the smoother nature of STFT compared to the time domain representation. We will include results on time domain modeling in an upcoming revision. > **Q2) Reconstruction of RIR without phase?** This step indeed needs to be better described in our paper, and we thank the reviewer for pointing this out. We model log(abs(STFT)), which is the log of the real component of the STFT. We leverage the temporal redundancy inherent in a STFT to recover the phase: * From the magnitude only STFT representation, we use the widely used iterative Griffin-Lim to reconstruct the phase component. * Once we have the phase, we can compute the inverse-STFT to recover the time domain (wave) representation of the impulse response. Recovering the phase using Griffin-Lim when only modeling the magnitude is an approach that has been utilized in other recent implicit audio learning work [1], and non-implicit sound synthesis tasks [4,5]. Other learned and traditional approaches for spatial audio do not model phase. These approaches utilize random phase reconstruction with magnitude-STFT [7], random phase [8] or minimum phase [9] filters. In human hearing for *impulse response* modeling, the onset time provides more important spatial cues [10,11] for low frequencies (~2kHz), while for higher frequencies the phase is not perceptible by humans [12]. Because we model the STFT, our work can capture the onset time. We will include additional experiments for STFT+phase modeling in an upcoming revision of our paper. > **Q3) Other evaluation metrics?** We agree that it is very valuable to have additional evaluation metrics. We want to note that MSE/L1 losses on the STFT magnitude have been used for evaluation and learning of sound generation [13,14]. Furthermore, the Griffin-Lim algorithm iteratively infers the phase from the magnitude by minimizing the MSE between inferred and provided real magnitude-STFT. To better characterize the performance of our models, we also present reverberation time (T60), a metric used to characterize the performance of impulse response modeling in other recent work [7]. Please refer to section **2.1** of our general response for T60 error metrics of our methods. We will provide additional T60 measurements in an upcoming revision. > **Q4) Do NAFs help NeRF only because of the smooth representation?** We agree that low frequency regularization could be one possible factor in the visual results. To demonstrate that the improved visual results do not come solely from the low frequency regularization, we investigate if applying L2 regularization to encourage a smooth latent space will improve the visual quality at unseen locations. We set the lambda_reg for the grid to 1e-5, as it provides approximately same gradient magnitude from regularization as reconstruction. The result is presented in section **2.2** of the general response. This shows that the improved generalization capability does not come solely from a smooth grid latent, but that our NAFs learn to encode spatially relevant information. We will provide additional experiments on linear-probe decoding in an upcoming version of our paper. We deeply appreciate the feedback from you, and will incorporate your suggestions in an upcoming revision of our paper. Please let us know if you have any other questions. [1] Du, Yilun, et al. "Learning Signal-Agnostic Implicit Manifolds." NeurIPS (2021). [2] Gao, Ruohan, et al. "ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and Tactile Representations." CoRL (2021). [3] Tancik, Matthew, et al. "Fourier features let networks learn high frequency functions in low dimensional domains." NeurIPS (2020). [4] Wang, Yuxuan, et al. "Tacotron: Towards end-to-end speech synthesis." Interspeech (2017). [5] Ren, Yi, et al. "Almost unsupervised text to speech and automatic speech recognition." ICML (2019). [6] Engel, Jesse, et al. "Gansynth: Adversarial neural audio synthesis." ICLR (2019). [7] Singh, Nikhil, et al. "Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis." ICCV (2021). [8] Pulkki, Ville. "Applications of directional audio coding in audio." ICA 2007. [9] Raghuvanshi, Nikunj, and John Snyder. "Parametric directional coding for precomputed sound propagation." SIGGRAPH (2018). [10] Chaitanya, Chakravarty R. Alla, et al. "Directional sources and listeners in interactive sound propagation using reciprocal wave field coding." SIGGRAPH (2020). [11] Shinn-Cunningham, Barbara G., Scott Santarelli, and Norbert Kopco. "Tori of confusion: Binaural localization cues for sources within reach of a listener." JASA (2000). [12] Oxenham, Andrew J. "How we hear: The perception and neural coding of sound." Annual review of psychology (2018). [13] Shen, Jonathan, et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." ICASSP (2018). [14] Défossez, Alexandre, et al. "Sing: Symbol-to-instrument neural generator." NeurIPS (2018). ## Reviewer HhWA We appreciate reviewer's assessment that the study is new and interesting. Thank you for the suggestions. Below are our responses to specific concerns. > **Q1) Comparison to NeRF and ray tracing.** We agree it is important to highlight the difference between our approach and NeRF. Our work utilizes **fundamentally different assumptions** compared to visual domain modeling. Our work is the first to attempt the modeling of spatial impulse responses with a neural network. * **[Scope of our work]** To clarify, like corresponding visual models (NeRF, SRN) our network is fit to a *specific* scene, and seeks to achieve generalization at unseen emitter/listener locations (novel views in NeRF) from sparse training samples in the *same* scene [1,2]. The learning of implicit representations that can generalize to unseen scenes is an open question in both vision and acoustics, and we look forward to exploring this in future work. * **[Generalizing to unseen locations]** NeRF leverages the multiview consistent nature of the visual world to learn a dense scene from sparse samples using photometric loss. However, multiview consistency cannot be assumed in spatial acoustic modeling. Because the anisotropic reflections are strongly affected by local geometry [3], we instead propose to learn the necessary local geometric attributes of a scene. We demonstrate in Figure 5. that by learning this geometric feature, we can generalize to unobserved locations significantly better than a simple MLP. * **[Acoustic rendering]** Sound propagation can be modeled as a linear time-invariant (LTI) system. In such a system, **all propagation paths and reflections** are captured by the impulse response [4]. To render the sound at an arbitrary location, we convolve the predicted impulse with the original signal in the time domain. Our approach is fully differentiable. In the revision, we will further clarify the differences between our model and NeRF. > **Q2) Memorization and network generalization.** * **[Memorization performs poorly]** We present **unambiguous evidence** that our model performance cannot be explained by memorization. We compare against strong nearest-neighbor and linear interpolation baselines in our paper at unseen locations. During testing, these methods require magnitudes more storage at inference (tens of gigabytes compared to tens of megabytes for our model). Despite this test time data advantage, these interpolation baselines perform worse than our method. * **[Contribution of our architecture]** We also perform ablation studies in Table 1. and Figure 5., and demonstrate that alternative architectures which do not share or do not use geometric features have lower generalization performance. These experiment that our architecture meaningfully contributes to the generalization performance, as all methods are trained on the same dataset. In the revision, we will highlight the novelty of our work, and the advantage our approach has over strong interpolation baselines. > **Q3) Implicit representation versus dataset.** An implicit representation allows us to infer the sound for emitters and listeners placed at dense locations not present in the training data, and is very compact compared to using the full dataset. The advantage of using our model over the dataset is multifold: * **[Compactness]** The model is much more compact than the dataset. The model is only tens of megabytes compared to tens of gigabytes of the dataset. * **[Continuous representation]** The dataset is not densely defined, and we wish to infer the acoustic response at locations not in the dataset. By learning an implicit function, our network utilizes learns the properties of sound propagation, and can generalize better than interpolation baselines. * **[Joint visual learning]** Our model helps with downstream tasks that can not be easily achieved with the original dataset. We will provide additional qualitative results using these interpolation baselines in addition to the current results on our website: https://sites.google.com/view/nafs-iclr-2022 > **Q4) Grid features provided to the network.** The reviewer may have misunderstood our approach. We **do not** provide grid features from the dataset directly to the network. Our grid features are learned from the impulse response itself. During training, the network is only provided with emitter/listener location and orientation, and is supervised with the log-STFT of impulse response. > **Q5) Additional visualizations.** As shown in Figure A4., the original dataset is sparse. We will provide additional loudness visualizations using nearest neighbor and linear interpolation in an upcoming version of our paper. > **Q6) Parameterization of the local feature grid.** We included details on our local grid in section B of the appendix. We will further describe the details here. Briefly, we initialize a grid with 32x32 resolution and 64 feature maps. The grid elements are initialized i.i.d. from a normal distribution with 0 mean and 1/sqrt(64) variance. We normalize a scene coordinate to between [-1, 1] using the axis aligned bounding box of the scene. We then query the grid using bilinear sampling. The sampling process is differentiable. The grid is learned using STFT magnitude supervision. We will provide a revision with these included details. > **Q7) RGB only baseline.** We utilize a standard two stage NeRF (hierarchical sampling). Both networks take as input a scene coordinate, and outputs an RGB value and an density parameter. The training dataset consists of view matrices and RGB image pairs. To ensure fairness with the NAF + NeRF experiment, we augment the "fine" network in NeRF with a learnable grid. We utilize 64 coarse samples and 128 fine samples per ray, and sample 1024 rays per batch. Supervision is provided via photometric (MSE) loss. > **Q7) Joint training of NAF and NeRF.** For the joint training baseline, in addition to the two NeRF networks, we also jointly learn a NAF. The NAF maps listener/emitter coordinates and orientations to the log-STFT magnitude. We condition the NAF network on the same grid as the second stage NeRF network. During training, the loss consists of both photometric NeRF loss and the MSE spectrogram loss. The grid in this setup is jointly optimized by both visual and acoustic losses. > **Q8) Notation and details.** * theta, k, and x * v is a single scale value We will clarify this in a revision of our paper, and note the dimensionality of our variables. Thank you again for your advice and feedback. We will incorporate the suggestions into the paper. [1] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." ECCV (2020). [2] Sitzmann, Vincent, Michael Zollhöfer, and Gordon Wetzstein. "Scene representation networks: Continuous 3d-structure-aware neural scene representations." NeurIPS (2019). [3] Raghuvanshi, Nikunj, and John Snyder. "Parametric directional coding for precomputed sound propagation." SIGGRAPH (2018). [4] Pierce, Allan D. Acoustics: an introduction to its physical principles and applications. Springer, 2019. ## Reviewer U1r6 We are encouraged by your assessment that this is an important and interesting question, and appreciate the detailed feedback. We will incorporate all suggestions into our paper. Below are our responses to specific comments. > **Q1) Data and generalization.** * **[Scope of our work]** Our work has the same scope as the corresponding visual models (NeRF, SRN, etc.), which fits a network to an *individual* scene, and seeks to achieve generalization to novel views (in our case novel listener/emitter positions) from sparse training data in the *same* scene [1, 2]. Modeling of completely unobserved locations using implicit networks is an open question in both vision and acoustics. * **[Generalizing in spatial audio]** In vision, one can leverage the multiview consistent nature of the scene to learn a dense representation using photometric loss. However we cannot make the same multiview consistent assumption in learning acoustic scenes. Instead, we leverage the fact that anisotropic reflections are strongly affected by local geometry [3]. We propose instead to generalize by explicitly learning the local geometric attributes. * **[Contribution of our work]** Our work represents the first step towards representing spatial acoustics using a neural network. We show that our system is significantly better than nearest neighbor and linear interpolation baselines. We lookforward to exploring generalization across scenes to a future work. > **Q2) Parameterization of the grid.** Thank you. We provided details about the grid in section B of the appendix. We provide additional details here. Briefly, we initialize a grid with 32x32 resolution and 64 feature maps. Each element of the grid is initialized independently from a gaussian with 0 mean and 1/sqrt(64) variance. For a coordinate from the scene, we normalize the coordinate to between [-1, 1] using the axis aligned bounding box of the scene. The features are queried using these normalized coordinates with bilinear sampling. We will provide the code for our model later in the review period. > **Q3) Blending left/right ear information.** The idea is very interesting, and we will provide qualitative results in an upcoming revision of our paper. It should be noted that our system is independent of the choice of panning method. For our empirical results we utilize linear panning, but HRTF based panning can also be utilized. > **Q4) STFT and phase** Our model outputs the magnitude. To recover the phase from the magnitude, we utilize the redundancy present in the STFT representation. 1. Given the magnitude only STFT representation, we apply the widely used iterative Griffin-Lim algorithm to reconstruct the phase [4]. After reconstructing the phase, we can perform inverse STFT to derive the time domain (wave) representation of the impulse response. 2. After computing the time domain impulse response, we convolve with an audio sample to render the final result. This approach of modeling magnitude only while using Griffin-Lim to recover phase is also used in other recent papers on sound generation [5,6,7]. In practice, many learned and traditional models for spatial audio modeling do not seek to represent the phase. These models use either random phase with STFT, random phase or minimum phase filters [8,9,3]. We will include these details in an upcoming revision of our paper. > **Q5) Temporal zero padding of the data.** We pad the data as not all impulse responses within a single scene are of the same length. A highly reverberant room will generally have longer impulse responses. Since the training data is very sparse, the ground truth length of the impulse is not defined for most locations in a scene. To avoid missing information in an impulse, we model an impulse response up to the maximum length of the impulse in a scene. However the vast majority of impulses are significantly shorter, and do not provide useful information in the padded portion. We stochastically pad the data during training, and allow the network to focus on modeling the energetic early stages of the impulse. Because we learn an implicit function, not all data samples have to be the same length. > **Q6) Selection of the test set.** We randomly select the test set. We do this to allow our network to observe all locations in a scene. Our goal is to generalize to continuous locations in a scene given only sparse training samples. In Figure 5, we observe that by learning a grid feature that is shared between the emitter and listener, we perform consistently better than a MLP even in very sparse settings. It should be noted that corresponding implicit visual approaches (NeRF, SRN, etc.) are also fit to a single scene, and requires parts of a scene to be observed to allow for inference. > **Q7) Clarification of equation 8.** Equation (8) is utilized when our NAF is present. The modified nearest neighbor baseline is using max magnitude and is not learned. Essentially the sample which has the highest average sound pressure level is determined to be the location of the emitter. We will clarify the setup in an upcoming revision of the paper. > **Q8) Notation and clarification.** * Dimensionality of the components. * Notation for intensity/NAFs. Thank you for the suggestions. We will include these details and change our notation to improve the clarity of our paper. We genuinely appreciate your advice and suggestions, and will include your feedback in a revision of our paper. [1] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." ECCV (2020). [2] Sitzmann, Vincent, Michael Zollhöfer, and Gordon Wetzstein. "Scene representation networks: Continuous 3d-structure-aware neural scene representations." NeurIPS (2019). [3] Raghuvanshi, Nikunj, and John Snyder. "Parametric directional coding for precomputed sound propagation." SIGGRAPH (2018). [4] Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE ASSP (1984). [5] Du, Yilun, et al. "Learning Signal-Agnostic Implicit Manifolds." NeurIPS (2021). [6] Wang, Yuxuan, et al. "Tacotron: Towards end-to-end speech synthesis." Interspeech (2017). [7] Ren, Yi, et al. "Almost unsupervised text to speech and automatic speech recognition." ICML (2019). [8] Singh, Nikhil, et al. "Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis." ICCV (2021). [9] Pulkki, Ville. "Applications of directional audio coding in audio." ICA 2007.