# R3 Strong Accept
**Q1: Allocate more on O2 and O3?**
Please note some were in Supp due to space constraints. To recap, for O2 and O3:
- O2 - Fig 5a: audio gives similar or even better spatial cues than PointGoal displacements (which assume perfect odometry)
- O2 - Fig 5b \& L529-537: failure cases
- O2 - Supp video at 13:30: heatmap to further illustrate
- O2 - t-SNE plots (Fig 3 in Supp, L516 main): validate learned audio features capture both distance and angle to goal.
- O2 - L521-8 ablation: gauge impact of audio beyond simply audio intensity
- O2 - Sec 6 \& Tab 1 of Supp: elaborate on that ablation
- O2 - Fig 6 \& Supp video: analyze relative impact of audio and visual streams
- O3 - Table 3: generalization to new sounds
- O3 - Fig 7: effects of different sound types, testing a) same sound, b) variety of heard sounds, and c) variety of unheard sounds.
- O3 - L611-7: why wider-frequency sounds perform better
**Q2: How much do materials contribute? Seems expensive.**
The most prominent effect is to change the RIR rate of decay (reverberation time). Materials can also alter the relative strength of early reflections. We devised an efficient strategy to obtain material annotations on the meshes by leveraging existing semantic segmentations (L208-15). This mapping can be easily done using a database by Egan et al. (L63-78, Supp).
**Q3: Would be interesting to investigate transfer to real world.**
Agreed. Please see the noisy microphone experiment (Q3 to R2).
**Q4: Only generalization for sound sources is studied?**
Actually, we do show generalization to both new environments and new sound sources.
In all experiments, the test environments are unseen (L420,L577).
**Q5: Authors don't have to, but would be nice to compare to Gan et al. ICRA 2020.**
We implemented Gan et al.’s approach and tested it on our dataset (their code and data are unavailable). Theirs achieves an SPL of 0.576, while ours is 0.742 SPL (Tab 3). So ours is 29% better. In contrast to our end-to-end RL agent, Gan et al. decouple the task into predicting the goal location solely based on audio and navigating to it with an analytic planner. Our simulation platform is more realistic for both visuals (computer graphics in Gan et al. vs. real-world scans in ours) and acoustics (ray tracing/sound penetration/full occlusion model in ours vs. low-cost game audio system in Gan et al.). Finally, our simulation data has 75$\times$ more source locations and 142$\times$ more binaural audio.
# R2 Weak Accept
**Q1: Recent robotics work combines different modalities, e.g., Lee et al. ICRA 2019 for vision and touch.**
Indeed, the interplay of multiple modalities has been of great interest to the vision, language, and robotics communities. We explore audio-visual learning, and hence focus our Related Work accordingly. However, we are happy to point to work combining other modalities like touch. To our knowledge, ours is the first work to explore the navigation abilities of an audio-visual agent. We generalize state-of-the-art deep RL policies within the first audio-visual 3D simulator compatible with embodied agent training.
**Q2: Does simulation support rendering multiple sources?**
Yes. Our simulations allow rendering audio for any given receiver location $r$ for the sound being emitted from any source location $s$. For multiple simultaneous sounds, one takes a linear combination of the convolved outputs at each source location. This allows our data to be effortlessly adopted for a variety of future work.
**Q3: I like the paper overall, analysis is very interesting...How does the agent perform with a distractor sound source?**
Noise is a form of distracting audio that can come from two sources: the environment and microphone. Environment noise includes room reflections, which we already capture. Reflections are considered environment noise since they are not from the direct sound source, e.g., when the goal is near a corner, audio is louder at the corner than the goal due to reflections. Microphone noise can be measured by the SNR in decibels. Following R1's suggestion, we tested our AudioGoal (depth) model under three SNR levels: 40dB (bad microphone), 60dB (good microphone) and 80dB (perfect microphone). Using the bad microphone, SPL drops from 0.742 to 0.726, whereas the good or perfect microphones do not adversely affect performance. Note: the PointGoal agent with perfect GPS only has SPL 0.592 (Tab 2). Hence our AudioGoal model is robust to microphone noise even without training for noise. For the same study on AudioPointGoal vs. PointGoal in the presence of GPS noise and microphone noise, AudioPointGoal still outperforms PointGoal by a substantial margin.
Our data and platform allow simulating distractor sounds at multiple locations, which will be interesting future work. It requires expanding the method to recognize the goal sound from distractors, among other things.
# R1 Borderline Reject
**Q1: Expected to see analysis of new modality, study of learned representation, interesting findings.**
We believe our paper offers these things. **Analyze the new modality:** we explore both AudioGoal and AudioPointGoal, and audio's generalization--from a single heard sound, to variable heard sounds, to variable unheard sounds (Tab 3). **Study the learned representation:** we show they encode distance and angle to the goal (L516-519 and Fig 3 t-SNE in Supp). **Interesting findings:** (1) audio alone surpasses standard GPS-based Point Goal (Tab 3); (2) audio agents fail closer to the goal compared to PointGoal agents (Fig 5b); (3) APG agent is most robust to the visual input modality (Tab 3), and with only RGB it better captures scene geometry (L482); (4) wider bandwidth sources help navigation (Fig 7, L608); (5) our agent generalizes to both unseen environments and unheard sounds.
**Q2: Study noise in audio?**
Good idea, thanks. Please see Q3 response to R2.
**Q3: Obvious: Fig 5a shows GPS noise does not affect audio.**
Fig 5a is not simply to say noisy GPS doesn't change audio. The point is to test if audio can supplant GPS for an audio goal. One of our most exciting findings is that audio is a natural relaxation of the assumptions of perfect odometry made while defining goals using GPS (PointGoal). PointGoal was a first step for embodied navigation [4]. Current research is pushing for ObjectGoal and RoomGoal to move beyond the perfect GPS assumption. Our objective with AudioGoal is similar (Fig 5a).
**Q4: Non-continuous sounds (eg, glass breaking)**
Continuous sounds are often navigation targets, eg, baby crying, music, people speaking, phone ringing. Still, we agree that brief sounds can also be interesting. Future work can study that case; the agent would receive only an initial audio observation then rely solely on vision.
**Q5: Impact of audio and visual.**
Zeroing out can be seen as a faulty sensor. It does introduce domain shift to the input. However, we find our model is fairly robust to this shift and Fig 6 offers qualitative insight. Kindly note that an obvious alternative of separately training a vision-only model to study vision’s impact isn’t feasible, as it would receive no cue at all about the goal (AudioGoal).
**Q6: Gan et al ICRA20 as concurrent work.**
We are happy to cite it. Please see Q5 response to R3.
**Q7: Dynamic?**
It means the cue is received at every time step [56] (L269).