###### tags: `draft` `thesis` `jptw`
# Chapter 4 Experiments and Results
## 4.1 Dataset
|Sample rate (kHz)|bit rate(kbps)|channel|format|
|---|---|---|---|
|48.0|320|stereo|WAV|
**Table 4.1**: The settings of recording mode.
In recent years has highly concerned with the issue as an global aging population, especially in Japan and Taiwan. Aging can affect all of the senses, but usually hearing and vision are most affected. For elderly adults with age-related hearing loss (ARHL), devices like hearing aids could help improve their ability to hear. Besides, we would like to solve the problem in different aspect - sound design and audio engineering. Therefore, we collected field recordings in public places in Japan and Taiwan, and focused on the sound design such as the beeping sounds on crosswalks or approaching music on railway platform. In this thesis, we took the experiments with these recordings, so we would discuss the usage and creation of the dataset.
Our dataset included 9 hours of field recordings (15 seconds to 50 minutes each). We collected the recordings from 9 different categories of acoustic environment (see the list in Appendix A.1), and the statistic distribution with recording places is shown in Fig. 4.1. All the recordings are captured using a binuaral setup with the settings listed in Table 4.1, and the recording equipment consists of [SCENES surround binaural recording earphone](http://www.scenessound.com/product/life-like.html) and iPhone 6.

**Figure 4.1**: The statistical distribution of places in Japan and Taiwan in our recording dataset.
The resulting audio files naturally included repeated occurences of sound events in the chosen categories listed in Appendix A.1. However, establishing ground truth is problematic because manually labeling these long-duration recordings is tedious and error-prone. We deliberately inserted specific sound events to provide known repeated events for algorithm testing. In order to narrow the scope of the research to a manageable size, we limited the examined sound events to recorded music in subway. As shown in Table 4.2, the approaching music of Taipei Metro were introduced throughout the recordings, and we designed different conditions to simulate these events under the noisy and reverberant environments. A single channel of the stereo signals was downsampled to 44.1 kHz and processed. Each sound event is separated in space by intervals of silence (5 to 15 seconds each), and limited the sound level lower recordings with -5 to -20 dB under different conditions.

**Table 4.2**: The experiment dataset design. Four songs of approaching music used in different lines of Taipei Metro were introduced throughout the recordings, and different scenarios and sound levels of events were implemented to simulate the real-life environments.
:::info
- The resulting audio files naturally included repeated occurrences of all the sound events in the chosen categories in a variety of environments
- A single channel of the resulting stereo signal was downsampled to 11025 kHz and processed
:::
> Ref:
> Recoder:
> [1] [Scenes 2 in 1 Surround Binaural Microphone/Headphone](https://www.amazon.ca/Scenes-Microphone-Recording-Headphone-Reproduce/dp/B06Y2DPQ7Z)
> [2] [Scenes森聲 | 3D全景聲錄音耳機(iPhone套組)](https://udesign.udnfunlife.com/mall/cus/cat/Cc1c02.do?dc_cateid_0=0_059_058_353&dc_cargxuid_0=DU00053064)
> [3] [SCENES LIFELIKE VR RECORDING EARPHONE](http://www.scenessound.com/product/life-like.html)
## 4.2 Experiment Design
:::info
- Server
- Database
- Query
- Method
:::
First, we briefly introduced the development setup of the QBEAT system. Our system is built using Python and Django web framework, and we designed following the architecture of REST API. In addition, a database is required, and MariaDB is chosen as the database server.
For algorithm testing, a set of experiments is conducted to show the performance of AF. In experiment procedure, we used SQLite as a smaller database to simulate the behaviors of retrieving information from database. The audio dataset we used is mentioned above. Since deliberately inserted sound events could be carefully scripted and documented, it provided ground truth (GT) for experiments. The proposed system used Audio Fingerprinting as main method, and we compared with cross correlation method in retrieval performance and retrieval time.
In the evaluation of performance, the balanced **_F1-score_** was used:
$$ F_1=2\cdot \frac{precision \cdot recall}{precision+recall} $$
where
$$ precision=\frac{TP}{TP+FP} $$
and
$$ recall=\frac{TP}{TP+FN} $$
Inserted sound events were bounded by onset and offset time, and so did detected regions. The confusion matrix is defined as below:
|Definition|
|---|
|- **_true positive_ (TP)**: event occurs and the region is detected|
|- **_false positive_ (FP)**: no event occurs but the region is detected|
|- **_false negative_ (FN)**: event occurs but no region is detected|
==Another measure used for illustrating retrieval performance is mean average precision (MAP) provided by `[11] mesaros2013query`. This is a single measure of performance in terms of ranked results, avoiding the trade off between precision and recall. The average precision for a query is calculated based on precision at each relevant audio file.
$$ AP=(\Sigma_{n=1}^{R} \frac{n}{rank_n})/R $$
where $R$ is the number of relevant audio files, $n$ is the number of selected regions and $rank_n$ is the total number of events in an audio file. The average precision is calculated for each query, and mean average precision over all queries to evaluated overall performance of the system.==
## 4.3 Experiment Analysis
:::info
Tuning: threshold, params
:::
### 4.3.1 Audio Fingerprinting
As discussed in Sec. 3.2, audio fingerprinting could be separated into two stage: fingerprinting and searching. In fingerprinting, parameters would affect the retrieving results: size of target zone, total number of paired fingerprints, and minimum amplitude of peaks. However, size target zone and number of pairs were correlated; for instance, it is worthless for a larger target zone while the number of pairs is the same. Thus, the following subsections would focused on the influence of factors of total number of pairs and the minimum amplitude of peaks, and size of target zone is fixed with $\Delta t$ and $\Delta f$, which are set to 5 (s) and 1200 (Hz) respectively.
|Symbol|Description|
|:---:|---|
|$n$ | total number for pairing fingerprints in a target zone|
|$v$ | threshold of peak value in a spectrogram|
|$d$ | variable subscript for audio dataset|
|$q$ | variable subscript for audio query|
#### Threshold of peak finding
$$
X(\omega, n_0) = \Sigma_{n=0}^{N-1}w[n]x[n+n_0]exp(-j\omega n)
$$
$$
P_x = 10\cdot log_{10}{(\frac{1}{T}{\mid {X_x}\mid}^2)}
$$
