afluo
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # NAFs Neurips rebut ## Summary of our response and discussion We genuinely thank all reviewers for their constructive comments which have contributed to the improvements in our paper. We sincerely appreciate the positive 8-8-8-6 evaluation from reviewers u3Tz, e4mi, ossn, and jrJA. Here is a summary of our response. ### Contributions We would like to first emphasize the contributions of this paper: * We propose Neural Acoustic Fields, which render the sound for arbitrary emitter and listener positions in a scene. Our NAFs are represented as an implicit function, and outputs the log-magnitude and phase information for a given impulse response. * We demonstrate that by conditioning the NAFs on a learnable spatial grid of features, we can improve the generalizability of our architecture. * We show that NAFs can learn geometric structure of a scene that can be useful for downstream tasks. ### Additional experiments * **[Interpolation baseline]** To address the concern of reviewer jrJA, we add additional experimental results using a "kernel ridge regression" baseline * **[Additional RIR metric]** To address the concern of reviewer jrJA, we add quantitative results on DRR for our impulse responses. * **[Spatial audio metric]** To address the concerns of u3Tz, we add an additional evaluation of the interaural cross correlation coefficient for our network and baseline outputs. We show that our network can better preserve spatial cues in the binaural impulse response than the baseline. ### Writing We thank all reviewers for suggestions regarding our writing and clarity. We believe that the clarifications suggested by the reviewers will improve the communication of our work. * We provide additional details about our network architecture [jrJA, u3Tz, e4mi], baseline setup [u3Tz, e4mi], and prior work [jrJA, u3Tz]. * We clarify that our NAFs are learned in time-frequency STFT domain, and provide additional details about our phase [u3Tz]. * We have provided additional details about our dataset [u3Tz]. We are deeply grateful to the reviewers for their helpful suggestions, which have helped improve our paper significantly. The additional experiments and clarifications will be reflected in the final version as well. Best, Authors ## General response (pre-revision) We are grateful to all reviewers for their constructive comments which we agree will significantly improve the communication of our work. We are very encouraged by reviewers’ evaluation on the significance and novelty of this work. All four reviewers find that our work on Neural Acoustic Fields (NAFs) to be novel (“This is a neat idea” (jrJA), “the general idea of NAFs is novel, interesting, and potentially impactful” (u3Tz), “the first, to my knowledge” (e4mi), “method appears to work well” (ossn)). ### 1. General clarifications #### 1.1 Network details and reproducibility Our method is fully reproducible. We have included a folder of our code, which contains hyperparameters, network architecture, and baselines as part of our supplementary material submitted. We hope the code will help the community reproduce our work and inspire later studies. #### 1.2 Differences from prior work We would first like to clarify that our work is concurrent with [1]. NAFs differentiates itself by learning a mapping for all possible emitter and listener locations in a scene. This is **fundamentally different** from prior work, which are in practice learned with a non-moving emitter or listener [1,2], or use handcrafted parameterizations of the sound field [3]. We demonstrate that by augmenting the network with shared geometric features that are shared by the emitter and listener, we can achieve a model that is better than a network not using, or using non-shared geometric features. We show that NAFs are significantly more compact than traditional audio coding baselines, and can achieve higher quality when evaluated on T60, spectral, DRR, or IACC error. We further show that audio representations learned by NAFs are informative of scene structure, making it a useful non-visual unsupervised scene representation. ### 2. Additional Experiments The reviewers also suggest that additional metrics and baselines will make the paper stronger, highlight its strengths, clarify potential limitations, and outline directions for future work. We agree, and have augmented our revision with additional qualitative results. We have added an additional baseline, results for direct-to-reverberant ratio (DRR) to better characterize the early components, as well as results for the interaural cross correlation coefficient (IACC) to characterize the spatial cues. We provide these metrics here, and will include them in the revision. #### 2.1 Interpolation baseline Here we compare the method proposed in [4] on the MeshRIR dataset. Where "Constrained-Orig" uses the 500Hz low pass filter as used by the authors, while "Constrained-Unfiltered" is our modification which uses the non-filtered impulse response. | | Spectral | T60 | DRR | |----------------------|-----------|-----------|-----------| | Constrained-Orig | 2.539 | 8.192 | 2.497 | | Constrained-Unfiltered | 1.370 | 6.294 | 3.702 | | NAF (Dual) | **0.403** | 4.201 | 0.992 | | NAF (Shared) | **0.403** | **4.191** | **0.972** | #### 2.2 DRR metric The Direct-to-reverberant ratio is used to measure the ratio of the energy coming from direct region. We find that NAFs have lower DRR error than baseline methods. | | Large 1 | Large 2 | Medium 1 | Medium 2 | Small 1 | Small 2 | MeshRIR | Mean | |--------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| | AAC-nearest | 1.748 | 2.424 | 1.344 | 1.343 | 1.213 | 1.108 | 1.286 | 1.495 | | AAC-linear | 1.797 | 2.147 | 1.457 | 1.458 | 1.117 | 1.226 | 1.222 | 1.490 | | Opus-nearest | 2.931 | 3.275 | 2.756 | 2.769 | 3.548 | 3.255 | 2.698 | 3.033 | | Opus-linear | 2.645 | 2.771 | 2.381 | 2.370 | 3.266 | 2.882 | 2.529 | 2.692 | | DSP | 3.559 | 4.421 | 4.727 | 4.805 | 5.622 | 6.723 | N/A | 4.976 | | NAF (Dual) |1.645 | 1.830 |1.113 |**1.082** | **0.796** |**0.799** | 0.992 |1.179 | | NAF (Shared) | **1.468** | **1.793** | **1.083** | 1.089 | 0.829 | 0.837 | **0.972** | **1.153** | #### 2.3 IACC metric The interaural cross correlation coefficient is used to measure the spatial localization from impulse responses, and is correlated with localization performance in humans. We find that NAFs achieve the lowest IACC error on average. | | Large 1 | Large 2 | Medium 1 | Medium 2 | Small 1 | Small 2 | Mean | |--------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| | AAC-nearest | 236.8 | 184.2 | 213.7 | 215.3 | 264.8 | 272.5 | 231.2 | | AAC-linear | 212.3 | 156.7 | 185.9 | 187.8 | 245.2 | 265.2 | 208.8 | | Opus-nearest | 73.75 | 45.97 | 71.97 | 74.70 | 103.8 | **67.40** | 72.93 | | Opus-linear | 75.56 | 48.32 | 73.38 | 77.33 | 109.2 | 78.10 | 76.98 | | DSP | 460.5 | 446.0 | 430.0 | 430.1 | 443.6 | 446.3 | 442.7 | | NAF (Dual) | 74.01 | 45.94 | 71.89 | 74.70 | 103.8 | **67.40** | 72.96 | | NAF (Shared) | **73.68** | **45.90** | **71.52** | **73.58** | **103.6** | **67.40** | **72.62** | * Mean absolute difference of IACC (unit in seconds, values here multiplied by 1e6). Lower is better. ### Conclusion We thank the reviewers for their careful feedback and additional suggestions for evaluation, which will make the paper significantly stronger. [1] Richard, Alexander, et al. "Deep Impulse Responses: Estimating and Parameterizing Filters with Deep Networks." (2022) [2] Richard, Alexander, et al. "Neural synthesis of binaural speech from mono audio." (2020) [3] Chaitanya, Chakravarty R. Alla, et al. "Directional sources and listeners in interactive sound propagation using reciprocal wave field coding." (2020) [4] Ueno, Natsuki, et al. "Kernel ridge regression with constraint of Helmholtz equation for sound field interpolation." (2018) ## Reviewer jrJA We are encouraged by your assessment that modeling scene acoustics is an important question and that our approach is a novel one. We thank Reviewer jrJA for the detailed and constructive review. Below are our responses to specific comments. We look forward to further discussion, and are happy to answer any questions. > **Q1) Comparison to previous sound field models** <!-- TODO: Add specific discussion about their paper--> We agree it is important to highlight the difference between NAFs and past work. Prior work has proposed both parametric and non-parametric methods to interpolate the sound field. Parametric methods typically only capture perceptually relevant cues, while non-parametric methods seek to estimate the sound field itself. Models have chosen to model the sound field as a linear composition of spherical or plane wave expansions. Methods similar to [1] typically leverage priors or assumptions about the sound field, such as physical constants, far field sound sources, or the position of the receivers. While these assumptions may hold true in certain settings, acoustic environments can be complex and deviate from model priors. Unlike these traditional approaches, our NAFs are learned from data. Furthermore, different from past approaches which typically estimate a sound field for a fixed source, our NAFs enable the arbitrary positioning of both the source and receiver. Since there is no public implementation of [1], here we provide additional quantitative results using the method described in "Kernel Ridge Regression With Constraint of Helmholtz Equation for Sound Field Interpolation" [2] on the MeshRIR dataset. We use two different variants of the model, the first using their original parameters which include a 500Hz low-pass filter, and the second where we modify the model to use the unfiltered RIR. We use their original proposed regularization value of 0.1. | | Spectral | T60 | DRR | |----------------------|-----------|-----------|-----------| | Ridge-Orig | 2.539 | 8.192 | 2.497 | | Ridge-Unfiltered | 1.370 | 6.294 | 3.702 | | NAF (Dual) | **0.403** | 4.201 | 0.992 | | NAF (Shared) | **0.403** | **4.191** | **0.972** | We find our model consistently out performs this baseline. We will include a discussion of [1, 2] and related methods in our updated revision. > **Q2) Differences in parameterization to "Deep Impulse Responses (DIRs)"** We want to first clarify that our work is concurrent with [3]. Both [3] and our work parameterize the impulse response as an continuous implicit function. However, DIRs assume a stationary source or receiver, and in practice they focus on a static receiver with emitters distributed on the sphere. NAFs allow both the source and receiver to move freely within a room, but requires us to model a much larger and complex set of impulse responses. This is a fundamentally more challenging problem. An additional difference is our parameterization of the output. NAFs parameterize the output as log-magnitude and instantaneous frequency (phase) [4], while DIRs output a time domain waveform directly. We experimented with using the representation and MSE training loss as proposed in DIRs, and these results are presented in section **H** of the revised supplementary. We observed that while outputting the waveform succeeds when modeling a small subset of the impulse responses, the network would only output an over-smoothed waveform when modeling an entire scene. We experimented with increasing the frequency of the fourier features, as this has been suggested to improve the ability of the network to model high frequency data [5]. However we found that this would introduce high frequency noise into the predicted impulse response. This led us to adopt an STFT based output representation. Prior work on using implicit networks for audio representations have similarly modeled either the log-magnitude of the STFT or the full magnitude-phase STFT [6, 7]. > **Q3) Results for the Direct-to-Reverberant Ratio** We agree that Direct-to-Reverberant Ratio (DRR) is a useful metric for characterizing room impulse responses. Here we present the mean absolute error of the DRR for each method: | | Large 1 | Large 2 | Medium 1 | Medium 2 | Small 1 | Small 2 | MeshRIR | Mean | |--------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| | AAC-nearest | 1.748 | 2.424 | 1.344 | 1.343 | 1.213 | 1.108 | 1.286 | 1.495 | | AAC-linear | 1.797 | 2.147 | 1.457 | 1.458 | 1.117 | 1.226 | 1.222 | 1.490 | | Opus-nearest | 2.931 | 3.275 | 2.756 | 2.769 | 3.548 | 3.255 | 2.698 | 3.033 | | Opus-linear | 2.645 | 2.771 | 2.381 | 2.370 | 3.266 | 2.882 | 2.529 | 2.692 | | DSP | 3.559 | 4.421 | 4.727 | 4.805 | 5.622 | 6.723 | N/A | 4.976 | | NAF (Dual) |1.645 | 1.830 |1.113 |**1.082** | **0.796** |**0.799** | 0.992 |1.179 | | NAF (Shared) | **1.468** | **1.793** | **1.083** | 1.089 | 0.829 | 0.837 | **0.972** | **1.153** | * Mean absolute error of the DRR, units in dB. Lower is better. Note the DSP baseline was not implemented for MeshRIR due to the lack of absolute coordinates. We thank the reviewer for the suggestions, and have added additional quantitative comparisons with a sound field interpolation method alongside DRR results. Following your suggestion, we have also reduced the length of section 3.1 in the revision. We will include additional discussion and add these results to the revision. [1] Antonello, Niccolo, et al. "Room impulse response interpolation using a sparse spatio-temporal representation of the sound field." (2017) [2] Ueno, Natsuki, et al. "Kernel ridge regression with constraint of Helmholtz equation for sound field interpolation." (2018) [3] Richard, Alexander, et al. "Deep Impulse Responses: Estimating and Parameterizing Filters with Deep Networks." (2022) [4] Engel, Jesse, et al. "Gansynth: Adversarial neural audio synthesis." (2019). [5] Tancik, Matthew, et al. "Fourier features let networks learn high frequency functions in low dimensional domains." (2020) [6] Gao, Ruohan, et al. "Objectfolder: A dataset of objects with implicit visual, auditory, and tactile representations." (2021) [7] Du, Yilun, et al. "Learning signal-agnostic manifolds of neural fields." (2021) ## Reviewer u3Tz We appreciate your assessment that the NAFs are a novel and interesting idea. We thank Reviewer u3Tz for the helpful review. Below are our responses to specific comments. > **Q1) Evaluation on binaural/spatial rendering** We agree that binaural cues are important and should be reflected in our evaluations. The interaural cross correlation coefficient (IACC) is a commonly accepted metric for the spatial localization of sound sources from binaural audio [1], and is believed to be predictive of human localization of sound sources [2]. The IACC coefficient is computed for each binaural impulse response, and the mean absolute difference between our predicted and ground truth IACC is taken. | | Large 1 | Large 2 | Medium 1 | Medium 2 | Small 1 | Small 2 | Mean | |--------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| | AAC-nearest | 236.8 | 184.2 | 213.7 | 215.3 | 264.8 | 272.5 | 231.2 | | AAC-linear | 212.3 | 156.7 | 185.9 | 187.8 | 245.2 | 265.2 | 208.8 | | Opus-nearest | 73.75 | 45.97 | 71.97 | 74.70 | 103.8 | **67.40** | 72.93 | | Opus-linear | 75.56 | 48.32 | 73.38 | 77.33 | 109.2 | 78.10 | 76.98 | | DSP | 460.5 | 446.0 | 430.0 | 430.1 | 443.6 | 446.3 | 442.7 | | NAF (Dual) | 74.01 | 45.94 | 71.89 | 74.70 | 103.8 | **67.40** | 72.96 | | NAF (Shared) | **73.68** | **45.90** | **71.52** | **73.58** | **103.6** | **67.40** | **72.62** | * Table 1. Mean absolute difference of IACC (unit in seconds, values here multiplied by 1e6). Lower is better. Our method has the lowest IACC error, which indicates that our method is capable of rendering spatial audio. We include this important metric in our revised paper. Thank you for your valuable suggestions. > **Q2) Technical clarifications** * **[Cost of ray tracing]** Soundspaces does not use 200 rays, but instead uses [5000 rays \* 200 bounces] for each listener, and [200 rays \* 10 bounces] for each emitter. We should clarify that in our paper we mean ray tracing in the context of a learned implicit neural representation of scene structure. Due to the computational cost, current state-of-the-art work in ray tracing in implicit neural representations is limited to a single bounce [3]. We will clarify this in the revision. * **[Discussion of prior work]** We agree that [4] is an important work in modeling binaural audio. However, the approach of our model and [4] are different. While [4] seeks to output the binaural audio directly, we output an impulse response which can be applied to mono audio. Secondly, NAFs model the STFT (log-magnitude and instantaneous frequency of phase), whereas [4] learns the time domain waveform. Finally, in practice [4] is trained and evaluated on data where the listener is fixed and only the emitter can move, while the NAF model is trained and evaluated on listener and emitter pairs which can both move. This requires the modeling of a much larger set of impulse responses. We did attempt to adapt [4] to our task by using an impulse function as input, and the impulse response as supervision. We could not successfully learn the impulse response in this modified setup, probably because their network was not tuned for this task. We will include additional discussion of [4] in our revision. * **[Instantaneous frequency]** We use the STFT phase instantaneous frequency representation proposed in GANSynth [5], which retains the phase for each frequency band. After the STFT for a waveform is computed, the phase angle within each frequency band is extracted and unwrapped over the 2π boundary, and the finite difference is taken over the time dimension. To get back to the time domain waveform, we take the cumulative sum of the instantaneous frequency over the time dimension within each frequency band. This is recombined with the magnitude, and is passed through inverse STFT. This recovers the exact same waveform as the input. The `get_wave_2` function provided in `testing/test_utils.py` in our supplementary shows how we recover the waveform. We thank the reviewer for helping us clarify this point. * **[Bitrates of baselines]** We believe you may have misinterpreted our results presented in Table 3. To clarify, our NAFs being smaller than the baselines is an ***advantage***, and is not a sign of NAFs being inferior to the baseline. The Opus and AAC baselines perform worse than NAFs despite being 20x and 40x the size. The code we used for implementing the baselines are in `baselines/make_data_aac.py` and `baselines/make_data_opus.py`, and was provided during the initial submission. We have also detail the version of the encoders we use in section F of our supplementary. We used libopus 1.3.1 and ffmpeg 5.0 native aac as the respective encoders. libopus was set to use maximum complexity for best quality, use music mode for better wideband performance, and use constrained variable bitrate mode. aac was set to use constant bitrate mode. In the revision, we will also mention in the main paper where to find the baseline details. > **Q3) Other clarifications** * **[NAFs and the time-frequency domain]** Thank you for pointing this out! We correct this in the revision in L48 and L114. * **[Time domain output]** Yes, L113 should just indicate the time domain waveform. * **[Figure of the shared grid network]** Our intention was to show the shared grid network in Figure 2. of our main paper, since it was the best performing architecture. We will highlight in Figure 2. that we are showing the "shared grid" design, and further include this figure in the supplementary to provide a better comparison. * **[Dataset details]** We discussed the restricted parameterization of SoundSpaces in section 4.2, and note that it is restricted to a 2D plane. In the revision, we will move the specifics about both SoundSpaces and MeshRIR into section 4.1. We thank the reviewer u3Tz for providing detailed and thoughtful feedback. Following their suggestions, we have run an evaluation to measure how well our framework preserves the binaural cues. We would like to highlight that code is provided in our supplementary for reproducibility. We do note that there may have been a misunderstanding regarding the size of our NAFs, and hope that our clarifications will aid the reviewer in their final evaluation, particularly in light of our additional results. [1] Rafaely, Boaz, et al. "Interaural cross correlation in a sound field represented by spherical harmonics." (2010) [2] Andreopoulou, Areti, et ak. "Identification of perceptually relevant methods of inter-aural time difference estimation." (2017) [3] Srinivasan, Pratul P., et al. "Nerv: Neural reflectance and visibility fields for relighting and view synthesis." (2021) [4] Richard, Alexander, et al. "Neural synthesis of binaural speech from mono audio." (2020) [5] Engel, Jesse, et al. "Gansynth: Adversarial neural audio synthesis." (2019). ## Reviewer e4mi We thank Reviewer e4mi for the helpful and constructive review. We address specific questions below, and will include additional details in a revision. > **Q1) Network details** The feature grid contains 64 features at each location, and is initialized from the gaussian distribution. In the case where individual grids are used for the emitter and listener, two grids are initialized independently. The network consists of 8 fully connected layers, and leaky ReLU with a slope of 0.1 is used as the activation function. The network has two output neurons, representing log-magnitude and instantaneous frequency (phase). Each fully connected layer uses 512 intermediate feature maps. The network is trained using the Adam optimizer with an initial learning rate of 5e-4, which decays to 5e-5 at the end of the training. The code definition for the network is provided as part of our supplementary. We will update our supplementary to better detail our hyperparameters and setup. > **Q2) Baseline and visualization details** The impulse responses are indeed processed directly using the baseline encoders. This choice was motivated by our desire to have a set of impulse responses that could be applied to arbitrary sounds. It would be possible to encode the post-convolution audio, however that would sacrifice the ability to generalize. The specific code we used to encode our data is provided in the `baselines` folder of the supplementary code. For the loudness visualization we compute the root mean square of the impulse. Code for the visualization can be found in `testing/vis_loudness_NAF.py`. > **Q3) Directional sounds** For our qualitative demos, many of the instances have the emitter placed quite far away from the listener. In cases where the emitter is not in the same room as the listener, the reverberation of the sound is more obvious, while the directional nature of the sound is less so. In cases where the listener is immediately outside of the doorway, the directional aspect should be most evident (eg. Emitter location 1 in Large 2 at around 0:24; Emitter location 2 in Large 2 at around 0:10; Emitter location 1 in Large 1 at around 0:32). The use of headphones may better highlight the directional effect. We agree that losses explicitly designed for maintaining directional cues are worth exploring. We thank you for your comments, and we hope that this clarifies our results! We will update the paper to reflect your suggestions. ## Reviewer ossn We are grateful to Reviewer ossn for the suggestions and comments. We address specific comments below. > **Q1) Quantitative and qualitative metrics** As part of the revision, we have additionally provided direct-to-reverberant ratio (DRR) error and interaural cross-correlation (IACC) coefficients error. The former should reflect how well we model the direct sound, while the latter should reflect binaural spatialization. In addition, we performed a human evaluation where subjects were provided with headphones and asked to perform a two-alternative forced-choice task, where over 82.38% found our NAFs to outperform the AAC-nearest baseline. We also provide qualitative samples on our project site: https://sites.google.com/view/nafs-neurips2022 > **Q2) Visualization of the spectrograms** All our spectrograms are presented with frequency on the vertical axis, and time on the horizontal axis. In Figure 3., (e)-(g) show the spectrogram of a long music sample that has been convolved. We have added axis labels and adjusted the orientation of our figure to improve clarity in the revision. > **Q3) Societal impact** Due to space constraints our societal impacts section was put on the last page of our supplemental. We will add a note in our revision to indicate where this section can be found. To further clarify, the primary use case for our work lies in virtual reality and gaming. As our work can lead to more believable and higher quality representations of spatial audio than alternative methods, it is possible that our work could increase the dependency and time spent on gaming. Thank you for your comments! We will address your feedback in the revision.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully