Chapter 4 Experiments and Results

###### tags: `v1` `thesis` `jptw` # Chapter 4 Experiments and Results ## 4.1 Dataset |Sample rate (kHz)|bit rate(kbps)|channel|format| |---|---|---|---| |48.0|320|stereo|WAV| **Table 4.1**: The settings of recording mode. In recent years has highly concerned with the issue as an global aging population, especially in Japan and Taiwan. Aging can affect all of the sense, but usually hearing and vision are most affected. For elderly adults with age-related hearing loss (ARHL), devices like hearing aids could help improve their ability to hear. In addition, we would like to solve this problem in different aspect -- sound design and audio engineering. We collected field recordings in public places in Japan and Taiwan, and focused on the sound design such as the beeping sounds on crosswalks or approaching music on railway platform. In this thesis, we took several experiments with these recordings, and we would discuss the utility and the generation of our dataset first. Our dataset included 9 hours of field recordings (15 seconds to 50 minutes each). We collected the recordings from 9 categories of acoustic environment (see the list in Appendix A.1), and the statistic distribution with recording places is shown in Fig. 41. All the recordings are captured using a binuaral steup with the settings listed in Table 4.1, and the recording equipment consists of [SCENES surround binaural recording earphone](http://www.scenessound.com/product/life-like.html) and iPhone 6. ![recording-numbers](https://i.imgur.com/8EkiDSR.png =80%x) **Figure 4.1**: The statistical distribution of places in Japan and Taiwan in our recording dataset. These audio files included repeated occurences of sound events in the chosen categories listed in Appendix A.1. However, it is difficult to establish ground truth since manually labeling these long-duration recordings is tedious and error-prone. Thus, we deliberately inserted specific sound events to provide known repeated events for algorithm testing. As shown in Table 4.2, the approaching music of Taipei Metro were introduced throughout the recordings, and we designed different conditions to simulate under noisy and reverberant environments. A single channel of the stereo signals was downsampled to 44.1 kHz and processed. Each sound event is separated in space by intervals of silence (5 to 15 seconds each), and limited the sound level lower the origin recordings with -5 to -20 dB under different conditions. ![exp-design](https://i.imgur.com/RKzrZv2.png =60%x) **Table 4.2**: The experiment dataset design. Four songs of approaching music used in different lines of Taipei Metro were introduced throughout the recordings, and different scenarios and sound levels of events were implemented to simulate the real-life environments. ## 4.2 Experiment Design First, we briefly introduced the development setup of the QBEAT system. Our system is built using Python and Django web framework, and we designed following the architecture of REST API. In addition, a database is required, and MariaDB is chosen as the database server. For algorithm testing, a set of experiments is conducted to show the performance of AF. In testing, we used SQLite as a smaller database to simulate the behaviors of retrieving information from database. The audio dataset is mentioned above. Since deliberately inserted sound events could be carefully scripted and documented, it provided ground truth (GT) for experiments. We implemented AF as main method, and compared with cross correlation (CC) in querying time and retrieval performance. In predictive analytic, we defined the elements of confusion matrix as below: |Definition| |---| |- **_true positive_ (TP)**: event occurs and the region is detected| |- **_false positive_ (FP)**: no event occurs but the region is detected| |- **_false negative_ (FN)**: event occurs but no region is detected| **Table 4.3**: The element definition of confusion matrix of sound event detection. In the evaluation of performance, the balanced **_F1-score_** was used: $$ F_1=2\cdot \frac{precision \cdot recall}{precision+recall} $$ where $$ precision=\frac{TP}{TP+FP} $$ and $$ recall=\frac{TP}{TP+FN} $$ However, there would be some loss between the region boundaries of the ground truth and the prediction, as shown in Fig. 4.2. In Fig. 4.2 (b), the precision of region boundary is better in AF than CC. In order to take region boundaries into account, we separated each region into segments of 1 second to get more precisely whether the prediction is TP, FP, or FN. In this case, the *F1-score* of Fig. 4.2 (b) in segment-based is 0.89 and 0.52, AF and CC respectively. ![](https://i.imgur.com/3OCA40h.png) (a) ![](https://i.imgur.com/zcoK5Jd.png) (b) **Fig. 4.2**: The ground truth and the prediction of regions with the event - "Taipe metro blue line music". The grey range denotes where the the sound event is inserted; the red range is the regions detected by AF, and the green range is detected by CC. In (b), the broken line shows the segments of this region. For evaluation of overall performance, we considered *F1-score* in segment-based and *recall* in event-based with two methods: AF and CC. We conducted a set of experiments for all relevant audio files with different audio queries, and showed the results in the next section.