<style>
img {
display: block;
margin-left: auto;
margin-right: auto;
}
</style>
> [Paper link](https://arxiv.org/abs/2203.14072) | [Code link](https://gewu-lab.github.io/MUSIC-AVQA/) | CVPR 2022
:::success
**Thoughts**
This paper introduces a new task called Audio-Visual Question Answering (AVQA).
They build a dataset with 45,867 question-answer pairs covering various audio-visual modalities and question types.
Additionally, they propose a spatio-temporal grounding model to enhance fine-grained scene understanding and reasoning.
:::
## Abstract
This study focuses on the Audio-Visual Question Answering (AVQA) task, which involves answering questions about visual objects, sounds, and their associations in videos.
They introduce the large-scale **MUSIC-AVQA dataset**, containing over 45K question-answer pairs across 33 question templates spanning various modalities and types
They also develop several baselines and propose a **spatio-temporal grounded audio-visual network** for the AVQA task.
## Background
This study introduces a new task called **Audio-Visual Question Answering (AVQA)**, which focuses on answering questions about visual objects, sounds, and their associations.
The image below illustrates a sample case for Audio-Visual Question Answering, which requires both auditory and visual modalities for multimodal scene understanding and spatio-temporal reasoning.

## Method
### MUSIC-AVQA Dataset
To explore scene understanding and spatio-temporal reasoning across audio and visual modalities, they created a large-scale dataset, MUSIC-AVQA, focusing on the question-answering task.
Recognizing the value of high-quality datasets for AVQA research, they manually collected musical performance videos from YouTube, selecting 22 instruments like guitar, cello, and xylophone.
They designed 9 audio-visual question types covering three scenarios: audio, visual, and audio-visual.
The table below compares MUSIC-AVQA dataset with other video QA datasets.

The figure provides statistics for the MUSIC-AVQA dataset.

### Spatio-temporal grounded audio-visual network
For the input video, both visual and audio sequences are divided into $T$ non-overlapping 1-second segments $\{ V_t, A_t \}_{t=1}^T$.
The question $Q$ is tokenized into $N$ words $\{ q \}_{n=1}^N$.
Three encoders are then used: VGGish for audio, ResNet-18 for visual, and LSTM for the question.

They associate specific visual locations with input sounds for spatial grounding, and use the question query to highlight audio and visual features at key timestamps for temporal grounding.
Finally, multimodal fusion integrates audio, visual, and question information to predict the answer.
## Experiment
To validate our method on the MUSIC-AVQA dataset, they compare it with recent audio QA methods.
They use answer prediction accuracy as the metric and evaluate model performance on various question types.
The answer vocabulary includes 42 options (22 objects, 12 counting choices, 6 location types, and yes/no).

The spatio-temporal grounding results are visualized as follows: the sounding area and key timestamps are highlighted in spatial and temporal perspectives (a-e), demonstrating their method's effectiveness in modeling spatio-temporal associations across modalities for improved scene understanding and reasoning.
Subfigure (f) shows a failure case where complex scenarios with multiple sounds and silent objects hinder accurate object-sound correlation, resulting in a wrong answer.
