<!--Kristen: should we put a global statement quoting all the positive things stated by reviewers? Changan: you mean a reply to all reviewers and AC? I wrote the following paragraph--> Thank you all for your valuable feedback. All reviewers acknowledge the novelty of developing the first audio-visual approach to dereverberation. Reviewers Tgq4, NGNE and NeeT appreciate the value of the created synthetic dataset and collected real dataset. Reviewers Tgq4, NeeT and j5yj point out the solid experiments and ablations. ## Tgq4 (6: Marginally Above the Acceptance Threshold) Thank you for the valuable feedback. **W1: Artifacts and distortion in some examples.** In this work, we focus on investigating if and how visuals help dereverberation for human perception (measured by PESQ) and machine perception (measured by WER and EER), and thus we perform spectrogram-level (frequency domain) optimization as it achieves a good balance between both. GAN-based models (MetricGAN+ and HiFi-GAN) produce audio with slightly less artifacts and distortion to our ears, but lead to significantly worse WER and EER scores (Table 1). Speech enhancement is only one of the 3 tasks we perform---and the only one of the 3 where human perception of the output is relevant (i.e., not so on speech recognition and speaker identification). We will explore a subjective user study for future work, but nonetheless the PESQ offers a measurable signal about the quality of our outputs relative to the baselines (W. Lin et al., Multimedia Analysis, Processing & Communications, 2011). <!--We did notice the distortion in the output audio especially in the presence of extreme reverberation. --> <!--We find replacing the MSE loss with L1 degrades the metrics on all three tasks and does not noticeably improve speech quality.--> <!--We tried to replace MSE loss with L1 but it degraded PESQ, WER and EER from 2.37, 4.44%, 3.97%, to 2.33, 4.72%, 4.79%. And we also did not find noticeable improvement in the speech quality. --> <!--Given our trained VIDA model, we think a Vocoder like WaveNet or Waveglow could possibly improve the speech quality by taking our predicted clean spectrogram and generate time-domain signals, but that deviates from our main purpose of exploring if visuals capture room acoustics and help dereverberation, and thus we'll leave it for future work.--> <!--Changan: shall we point that MetricGAN/HIFi-GAN has pretty good quality but does do well on ASR and SV?--> <!--Kristen: Let's write this one more from point of view that speech enhancement is only one of the 3 tasks we perform---and the only one of the 3 where human perception of the output is relevant (i.e., not so on speech recognition and speaker identification). Need the AC to notice that this doesn't undercut all the results. We will explore a subjective user study for future work, but nonetheless the PESQ offers some measurable signal about the quality of our outputs relative to the baselines.--> <!--What is written currently sounds like we are pretty defeated and helpless about the complaint. --> **W2: Using a Hann window with overlap-add instead of non-overlapping concatenation.** Thank you for the suggestion. We tried overlap-add with Hann windows on spectrograms and time-domain signals, but neither of them improved the performance. We speculate that our non-overlapping concatenation works better because we directly minimize MSE loss on spectrograms. **W3: Where does the "random speech embedding sampled from the data batch." come from?** This random speech embedding is the audio feature of a different sample in the same input batch. We have updated Section 5 in the pdf. **W4: Marginal performance improvement by adding the visual information.** The relative improvement of VIDA over the audio-only baseline is 2% for PESQ, 10% for WER, and 15% for EER. The results are statistically significant according to a paired t-test (p-values are 1.56e-60 for PESQ, 3.70e-08 for WER, 2.58e-43 for speaker verification scores). The "w/ random images" ablation in Table 1 has exactly the same amount of parameters as VIDA---and it has performance similar to the audio-only model. The comparison between VIDA and this ablation strongly suggests that VIDA learns acoustic information to help dereverberation. We also validated the model's performance with a sim2real evaluation (Table 2, Table 3), which demonstrates VIDA is not simply learning to exploit artifacts of the simulator. <!--In Table 1, we see that audio-only achieves comparable performance to SOTA (MetricGAN+,HiFi-GAN) approaches on dereverberation. In Table 2, despite MetricGAN+'s higher PESQ value on the real data, its WER and EER are also significantly lower compared to our Audio-only ablation.--> We’d also like to point out that WER on LibriSpeech is in the low single digits, and absolute improvements of a fraction of a percent on any of the evaluation subsets are significant. The SOTA leaderboard for LibriSpeech is captured here: https://github.com/syhw/wer_are_we, only this doesn’t match our setting since these methods assume the input speech is clean. The “Voices from a Distance 2019 challenge” compares recent work on far-field, reverberant audio-only ASR: https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1837.pdf. Among the top 4 systems (Figure 2), relative differences between adjacent systems range between 6.25% and 24.6%; we achieve a 11% relative gain over the audio-only baseline. The tasks are not identical (our method needs visual input and so is not applicable to Voices from a Distance) but this suggests that our gains are in a respectable zone for this community. **W5: Adding trainable parameter count to Table 1.** We added the trainable parameter counts for all models and their inference time in Section A.9 due to space constraints. We would argue that the performance improvement is not due to more trainable parameters because the "w/ random images" ablation in Table 1 has exactly the same amount of parameters as VIDA, which only performs similarly to audio-only. <!--We would argue that the performance improvement is not due to extra parameters, because the "w/ random images" ablation in Table 1 has exactly the same amount of parameters as VIDA---and it has performance similar to Audio-only dereverb. The comparison between VIDA and this ablation strongly suggests that VIDA learns acoustic information to help dereverberation.--> **W6: Adding noise with lower SNR.** In our submission, we had added noise with 20dB SNR following prior work (Ernst et al., 2018, Nakatani et al., 2010). Per the reviewer's request, to validate the model's robustness against extreme noise, we trained both audio-only and VIDA under WHAM noise with 5dB SNR. The PESQ, WER, EER scores for audio-only dropped from 2.33, 6.53, 4.83 to 1.52, 26.4 and 15.61. The PESQ, WER, EER scores for VIDA dropped from 2.37, 4.44, and 3.97 to 1.55, 25.37 and 14.16 respectively. Despite the huge drop of performance due to the extreme noise, our audio-visual model still outperforms the audio-only model on all tasks. **M1: Minor: Inaccurate statement regarding RIRs.** Thank you for pointing this out. We wanted to stress the relative positioning of the speaker with respect to the listener. But yes, RIR is a function of the absolute positioning of both the speaker and listener. We have updated Section 3. **M2: Be more specific about the source-to-microphone distances used in the real recordings.** We have updated the distance range for each scenario in Table 3. **M3: Be more specific about the real data.** We used 10 utterances balanced between the gender for each location. We use different speakers across rooms. We have incorporated this into the text in Section 4. **M7: Blind RT60/DRR/distance prediction accuracy.** These errors are absolute. We have updated in Section A.6. **M4/M5/M6: Other minor writing comments.** Thank you for your feedback. We have updated the pdf. ## NGNE (3: reject, not good enough) Thank you for the valuable feedback. **Q1: Audio-only vs. VIDA gains are modest** We do not intend to claim that visuals can replace an audio model for dereverberation. Instead, we show that a visual model provides complementary information and leads to consistent performance gains. As for the performance difference, the relative improvement of VIDA over the audio-only baseline is 2% for PESQ, 10% for WER, and 15% for EER. The results are statistically significant according to a paired t-test (p-values are 1.56e-60 for PESQ, 3.70e-08 for WER, 2.58e-43 for speaker verification scores). We also validated the model's performance with a sim2real evaluation (Table 2, Table 3), which demonstrates VIDA is not simply learning to exploit artifacts of the simulator. For the significance of WER improvement, see discussion on LibriSpeech WER in W4 for reviewer Tgq4. Thus, our results do indeed reflect that visual information is helpful, contrary to the reviewer's claim. <!--David: It might also be worth pointing out here that the improvements carried over in the sim2real evaluation, which further demonstrates that VIDA is not simply learning to exploit artifacts of the simulator--> <!--We’d like to point out that WER on LibriSpeech is in the low single digits, and absolute improvements of a fraction of a percent on any of the evaluation subsets are significant. The SOTA leaderboard for LibriSpeech is captured here: https://github.com/syhw/wer_are_we, only this doesn’t match our setting since these methods assume the input speech is clean. The “Voices from a Distance 2019 challenge” reports recent work on far-field, reverberant audio-only ASR: https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1837.pdf. Among the top 4 systems (Figure 2), relative differences between adjacent systems range between 6.25% and 24.6%; we achieve a 11% relative gain over the audio-only baseline. The tasks are not identical (our method needs visual input and so is not applicable to Voices from a Distance) but this suggests that our gains are in a respectable zone for this community.--> <!--David: This line is a duplicate of line 31 in W4; perhaps we should combine the responses? --> **Q2: Similar approaches have been used in audio-visual enhancement with lip information.** Unfortunately the reviewer writes no more than this single sentence about this point. The reviewer does not specify any references or elaborate in what way prior work conflicts with our claims about novelty. To our knowledge, no prior work attempts audio-visual dereverberation (see "Audio-visual learning from video" paragraph in Section 2). The visual information used by VIDA to infer reverberation is an observation of the room/environment. Lip movements are not used in VIDA, though could be added as an additional, complementary visual input. <!--David: We might want to clarify/re-state here that the visual information used by VIDA is an observation of the room/environment, which is used to infer its reverberation characteristics, and that lip movements are not used in VIDA but could be added as an additional, complementary visual input. This isn't the first time that a reviewer has gotten confused over this issue, and it would be good to do as much as we can to help the AC understand the difference.--> ## NeeT (5: marginally below the acceptance threshold) Thank you for your valuable feedback. <!--Kristen ``sufficient info to support that visual characterizes room acoustics" -- below please adjust -- first, big picture message of our results - point simply to how the visual does improve results, Table 1, Table 2, main punchline (point to appropriate paras). This is also for the AC's benefit. Then you can do the summary below of the other findings supporting vision, but keep them short.--> **Q1: Not sufficient information to support that visual inputs characterize room acoustics.** This is the first work that proposes to leverage visual inputs to characterize room acoustics and help dereverberation. We show that our audio-visual approach improves dereverberation on three speech tasks compared to multiple audio-only approaches, as experimentally validated on both synthetic and real data (Table 1, Table 2). These numbers strongly suggest that the model learns acoustics information from visuals to help dereverberation. Further, we analyze how the visual factors contribute to VIDA's performance: 1) full observation of the room geometry is useful to capture more geometric information (Table 1); 2) the model leverages the distance to speaker for understanding room acoustics ("w/o human mesh" ablation in Table 1); 3) the reverb-visual matching loss helps the model to learn better visual features; 4) we analyze the learned visual features by performing TSNE projection and show that they have good correlation with RT60 and room size (Figure 4 (c-d)). **Q2: Reverb matching loss, ``performance was better with the audio-only model even in PESQ".** The reviewer might have seen the wrong numbers. In Table 1, VIDA without reverb matching loss underperforms the full model, but still outperforms Audio-only on all metrics (note: higher PESQ is better). Secondly, while reverb-visual matching is one interesting part of our approach design, it is not the key idea of this paper. Our key idea is to leverage visual for dereverberation and we investigated ways to make model learn acoustics information better. Besides the matching loss, there are many other important factors that contribute to VIDA's performance (see the response to Q1). <!--Kristen - please adjust here -- first point to the ablation and what it concluded briefly, argue why it is significant; and the second part seems like an error by the reviewer - Table 1 shows that audio-only is not better than ours with or without reverb loss, right? we need to correct this. Also, point out that our main contribution is not only the visual-reverb loss, it is... [his summary says ``not thorough enough to spport the key idea of reverb-visual matching"]--> **Q3: Random image has pretty good performance compared to audio-only.** Thank you for raising this point. As noted in the paper, this "w/ random image" setting was only used for testing the model (see para "Ablations" in Section 6). It is likely that because our VIDA model is better trained, it has better performance than audio-only even when images are random at test time. Simply swapping the image with an incorrect image leads to a large drop on all three metrics, which indicates that our model relies on the acoustics cues in the image for dereverberation. However, taking the reviewer's feedback into account, we have now **trained** the model with random images and tested it. The results are shown in the following table. When VIDA is trained with random images as input, its performance is very similar to the audio-only baseline, as the reviewer was expecting to see. This new ablation has exactly the same number of parameters and architecture as VIDA, and the only difference is the content of the image. It shows that our model reasons about the image content to capture room acoustics to help dereverberation. | | PESQ | WER (%) | EER (%) | :--- | :----: | ---: | ---: | | Audio-only | 2.32 | 4.92 | 4.67 | Audio-only + R-vectors | 2.23 | 5.23 | 4.82 | VIDA w/ random image | 2.34 | 4.94 | 4.70 | VIDA | **2.37** | **4.44** | **3.97** **Q4: Model efficiency.** We updated the trainable parameter counts of all models and their inference time in Section A.9 of the appendix due to space constraints. Despite having more parameters, VIDA runs comparably to Audio-only (ablation of VIDA) and faster than MetricGAN+, HiFi-GAN and WPE. We have updated Section 6 accordingly. **Q5: R-vectors approach for comparison.** Thank you for suggesting this baseline (Khokhlov et al., INTERSPEECH 2019). R-vectors proposes to train an acoustics-aware network in a self-supervised way and use its feature embedding for distant speech recognition. Though we could not find an existing public implementation of the R-vectors model suggested by the reviewer, we implemented and trained it on our dataset, where the R-vectors features are concatenated with the audio features similar to visual features. The result for this model is shown in the above table. Adding R-vectors did not improve the performance of the Audio-only baseline. Due to the constraint of space, we have added this experiment and implementation details of R-vectors in Section A.10. <!--We'd like to first point out that the WER-FT experiment in Table 1 already finetunes the transformer ASR model on reverberant speech and performs distant speech recognition. Even when the ASR and SV models are finetuned, our VIDA model still achieves a consistent performance improvement compared to the baselines (Table 1). --> <!--KRISTEN - would we not try to put this in the main table for a camera ready? --> ## Reviewer j5yj (8: accept, good paper) Thank you for your valuable feedback. **Q1: Lack of distribution of scores or practical meaning of score difference.** The results are statistically significant according to a paired t-test (p-values are 1.56e-60 for PESQ, 3.70e-08 for WER, 2.58e-43 for speaker verification scores). Because the input speech varies drastically depending of reverberation or difficulty per test sample, there is large variance for all models. Instead, to show the spread of scores, we plot the cumulative WER vs PESQ or distance to show how the scores vary as a function of difficulty in Figure 4 (a-b). As for the score difference, we explain in the "Results in scanned environments" paragraph in Section 6 that the pretrained ASR and SV models we used yield errors competitive with the SoTA, and thus the relative improvement is considerable. <!--CAN YOU ALSO DEFEND THE P VALUES FOR STATISITCAL SIGNIFICANCE?--> **Q2: Why using phase prediction + GL refinement is better than GL + random initialization.** Empirically we find predicting phase followed by GL refinement works better than using random initialization. We hypothesize that using the predicted phase to initialize GL helps convergence as opposed to using a a randomly initialized phase. The PESQ, WER degrades from 2.37, 4.44% to 2.27, 4.50% when using the random initialization. For equation 4, we tuned the weights factors for phase loss and matching loss to achieve the best performance. <!--Changan: EER degrades slightly with random phase. not including it for now. The PESQ, WER, EER degrades from 2.37, 4.44%, 3.97% to 2.27, 4.50% and 3.95% when using the random initialization.--> **Q3: Figure 5 is hard to parse.** Thank you for the suggestion. We have updated the figures. **Q4: Geometric differences in the environments.** The distribution of WER as a function of distance is shown in Figure 4 (b). When distances become larger, the speaker is more likely to be out of view, in which case VIDA also outperforms baselines to the largest extent. To sort the results from our submission in a way that answers the reviewer's question even more directly, the table below shows the performance of all methods when there is direct sound (left 2 columns) and when there is no direct sound (right 2 columns). There are 1621 direct cases, and 979 non-direct cases. When there is no direct signal, the inputs tend to be more reverberant and difficult. Regardless, our VIDA model consistently outperforms baselines. | | PESQ | WER | PESQ | WER | | :--- | :----: | ---: | :----: | ---: | | Reverberant | 1.67| 4.38 | 1.32| 16.71| | MetricGAN+ | 2.60| 3.60 | 1.88| 6.85 | | HiFi-GAN | 2.09| 4.82 | 1.40| 16.39| | Audio-only | 2.58| 3.42 | 1.89| 7.32 | | VIDA | **2.61**| **3.33** | **1.96**| **6.27** |